Table of Contents

Using Data to Analyse Elite Runners and their Performance through History

1. Introduction

1.1. Aim

The aim of this project is to investigate how the performance of elite runners is linked to factors like height, weight and age. This will be done by analysing historical data about athlete performances through history, mainly in modern Olympic running events. Understanding the relationship between these characteristics and athlete performance would be extremely helpful in developing training plans and improving athlete performance in the future.

1.2. Specific Questions Addressed

Athlete performance is clearly affected by many factors, and this analysis will be limited to just a few of them, dictated mainly by the data available. The specific questions addressed here are:

  1. How have elite runners' performances changed through history?
  2. How have Olympic runners' characteristics (height, weight, age, body mass index) changed through history?
  3. Is there a relationship between Olympic runners' characteristics and their performance?

There will be no particular modelling or machine learning in this analysis because the questions can be answered by visualising the statistics alone.

In [1]:
# Import libraries
import pandas as pd
import chardet                       # For character encoding
import ftfy                          # For fixing encoding issues
from matplotlib import pyplot as plt
from matplotlib import pylab as plb  # For best fit lines
from datetime import datetime, time, timedelta, date
import numpy as np
from fuzzywuzzy import fuzz          # For inexact ("fuzzy") string matching
from fuzzywuzzy import process
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()     # Future compatibility plotting datetime

2. Data Understanding

2.1. Load Data and Initial Analysis

This analysis will attempt some originality by combining three separate data sets. This allows athlete characteristics to be linked to athlete performances so any relationship between the two can be investigated.

First, load the data sets and briefly examine them.

2.1.1. Olympic Games Results and Athletes, 1896-2016

The first data set is the Olympic Games results and athlete data, 1896-2016.

Source:

https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results/downloads/120-years-of-olympic-history-athletes-and-results.zip/2

In [2]:
# Results data is in the first file:
all_olympics = pd.read_csv('datasets/athlete_events.csv')
all_olympics.head()
Out[2]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN
In [3]:
all_olympics.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
ID        271116 non-null int64
Name      271116 non-null object
Sex       271116 non-null object
Age       261642 non-null float64
Height    210945 non-null float64
Weight    208241 non-null float64
Team      271116 non-null object
NOC       271116 non-null object
Games     271116 non-null object
Year      271116 non-null int64
Season    271116 non-null object
City      271116 non-null object
Sport     271116 non-null object
Event     271116 non-null object
Medal     39783 non-null object
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB

Summary

The full Olympic Games data set contains useful information over a 120 year period about the competitiors (height, weight, age, country of origin, and medal, if they won one).

2.1.2. Olympic track and field times and results

The second data set contains the Olympic track and field times and results. Note it only includes data for medal winners. Source:

https://www.kaggle.com/jayrav13/olympic-track-field-results/downloads/olympic-track-field-results.zip/1

In [4]:
# Data set 2 - Olympic track and field times and results. Source:
# www.kaggle.com/jayrav13/olympic-track-field-results/downloads/olympic-track-field-results.zip/1
# There is an additional unlabelled column in a few of the rows.
# Therefore, read explicitly labelled columns and discard unlabelled column.
ol_tf = pd.read_csv('datasets/results.csv', names=['Gender',
                                                   'Event',
                                                   'Location',
                                                   'Year',
                                                   'Medal',
                                                   'Name',
                                                   'Nationality',
                                                   'Result'])
ol_tf.drop(index=0, inplace=True)
ol_tf.head()
Out[4]:
Gender Event Location Year Medal Name Nationality Result
1 M 10000M Men Rio 2016 G Mohamed FARAH USA 25:05.17
2 M 10000M Men Rio 2016 S Paul Kipngetich TANUI KEN 27:05.64
3 M 10000M Men Rio 2016 B Tamirat TOLA ETH 27:06.26
4 M 10000M Men Beijing 2008 G Kenenisa BEKELE ETH 27:01.17
5 M 10000M Men Beijing 2008 S Sileshi SIHINE ETH 27:02.77
In [5]:
ol_tf.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2394 entries, 1 to 2394
Data columns (total 8 columns):
Gender         2394 non-null object
Event          2394 non-null object
Location       2394 non-null object
Year           2394 non-null object
Medal          2394 non-null object
Name           2164 non-null object
Nationality    2394 non-null object
Result         2394 non-null object
dtypes: object(8)
memory usage: 168.3+ KB

Summary

The most useful feature of the track and field results is the detailed running times and event results. This will be linked to the full Olympic data (including its information on the athletes' characteristics) later in the analysis.

2.1.3. Top 1000 Running Performances for all Olympic Distances and the Half Marathon

The third data set contains the top 1000 running performances for each running event.

Source:

https://www.kaggle.com/jguerreiro/running/downloads/running.zip/2

In [6]:
top_running = pd.read_csv('datasets/data.csv')
top_running.head()
Out[6]:
Rank Time Name Country Date of Birth Place City Date Gender Event
0 1 00:01:40.910000 David Rudisha KEN 1988-12-17 1.0 London 2012-09-08 Men 800 m
1 2 00:01:41.010000 David Rudisha KEN 1988-12-17 1.0 Rieti 2010-08-29 Men 800 m
2 3 00:01:41.090000 David Rudisha KEN 1988-12-17 1.0 Berlin 2010-08-22 Men 800 m
3 4 00:01:41.110000 Wilson Kipketer DEN 1970-12-12 1.0 Köln 1997-08-24 Men 800 m
4 5 00:01:41.240000 Wilson Kipketer DEN 1970-12-12 1.0 Zürich 1997-08-13 Men 800 m
In [7]:
top_running.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18244 entries, 0 to 18243
Data columns (total 10 columns):
Rank             18244 non-null int64
Time             18244 non-null object
Name             18244 non-null object
Country          18244 non-null object
Date of Birth    18244 non-null object
Place            18236 non-null float64
City             18244 non-null object
Date             18244 non-null object
Gender           18244 non-null object
Event            18244 non-null object
dtypes: float64(1), int64(1), object(8)
memory usage: 1.4+ MB

Summary

This data set is good because it contains a large number of data points (1000) including finish times for every running discipline. It is not limited to Olympic performances, but all the events are Olympic distances, with the exception of the half marathon.

3. Data Preparation

3.1. Olympic Results Data

3.1.1. Events

In [8]:
print("Number of unique events is {}"
      .format(len(all_olympics['Event'].unique())))
Number of unique events is 765

765 events is far too many to analyse. It also includes some events which have not taken place in the Olympics for a long time. This analysis is focussed on modern running events, so we will extract a subset of the results.

In [9]:
olympic_sports_groups = all_olympics.groupby('Sport')
athletics = olympic_sports_groups.get_group('Athletics')
all_athletics_events = athletics['Event'].unique()
all_athletics_events
Out[9]:
array(["Athletics Women's 100 metres",
       "Athletics Women's 4 x 100 metres Relay",
       "Athletics Men's Shot Put", "Athletics Men's Pole Vault",
       "Athletics Men's High Jump", "Athletics Men's 1,500 metres",
       "Athletics Men's 4 x 100 metres Relay",
       "Athletics Men's Long Jump", "Athletics Women's Javelin Throw",
       "Athletics Men's 5,000 metres", "Athletics Women's Shot Put",
       "Athletics Men's 110 metres Hurdles", "Athletics Women's Marathon",
       "Athletics Men's 100 metres", "Athletics Men's 400 metres Hurdles",
       "Athletics Men's 400 metres", "Athletics Men's Hammer Throw",
       "Athletics Men's 800 metres", "Athletics Men's Marathon",
       "Athletics Men's 4 x 400 metres Relay",
       "Athletics Men's 10,000 metres", "Athletics Women's 3,000 metres",
       "Athletics Men's 200 metres", "Athletics Men's Javelin Throw",
       "Athletics Men's 3,000 metres Steeplechase",
       "Athletics Women's 200 metres", "Athletics Men's Triple Jump",
       "Athletics Women's Long Jump", "Athletics Women's 5,000 metres",
       "Athletics Men's Discus Throw", "Athletics Women's 10,000 metres",
       "Athletics Men's Decathlon", "Athletics Women's Discus Throw",
       "Athletics Women's 1,500 metres", "Athletics Women's Pole Vault",
       "Athletics Women's 4 x 400 metres Relay",
       "Athletics Women's 800 metres", "Athletics Women's 400 metres",
       "Athletics Men's Javelin Throw, Both Hands",
       "Athletics Women's 400 metres Hurdles",
       "Athletics Women's Pentathlon", "Athletics Women's High Jump",
       "Athletics Men's Standing High Jump",
       "Athletics Men's Standing Long Jump",
       "Athletics Men's 20 kilometres Walk",
       "Athletics Men's 50 kilometres Walk",
       "Athletics Women's 100 metres Hurdles",
       "Athletics Men's Discus Throw, Greek Style",
       "Athletics Women's Heptathlon", "Athletics Men's Stone Throw",
       "Athletics Men's Javelin Throw, Freestyle",
       "Athletics Men's Pentathlon (Ancient)",
       "Athletics Women's 3,000 metres Steeplechase",
       "Athletics Men's Shot Put, Both Hands",
       "Athletics Men's 10 kilometres Walk",
       "Athletics Women's 20 kilometres Walk",
       "Athletics Women's 80 metres Hurdles",
       "Athletics Women's Triple Jump",
       "Athletics Women's 10 kilometres Walk",
       "Athletics Men's Cross-Country, Individual",
       "Athletics Men's Cross-Country, Team",
       "Athletics Men's 3,000 metres, Team", "Athletics Men's Pentathlon",
       "Athletics Men's 3,000 metres Walk", "Athletics Men's 5 mile",
       "Athletics Women's Hammer Throw",
       "Athletics Men's 3,200 metres Steeplechase",
       "Athletics Men's Standing Triple Jump",
       "Athletics Men's 4,000 metres Steeplechase",
       "Athletics Men's 5,000 metres, Team",
       "Athletics Men's 1,600 metres Medley Relay",
       "Athletics Men's 60 metres", "Athletics Men's 1,500 metres Walk",
       "Athletics Men's 3 mile, Team",
       "Athletics Men's 3,500 metres Walk",
       "Athletics Men's 10 mile Walk",
       "Athletics Men's Discus Throw, Both Hands",
       "Athletics Men's 200 metres Hurdles",
       "Athletics Men's 56-pound Weight Throw",
       "Athletics Men's 2,500 metres Steeplechase",
       "Athletics Men's All-Around Championship",
       "Athletics Men's 2,590 metres Steeplechase",
       "Athletics Men's 4 mile, Team"], dtype=object)

This is a more manageable list of events. There are still some events here that don't exist in the modern Games. The next step is to remove any events that didn't take place in the most recent summer Games (2016).

In [10]:
modern_athletics_events = athletics[
    athletics['Year'] == 2016]['Event'].unique()
modern_athletics_events
Out[10]:
array(["Athletics Men's 5,000 metres", "Athletics Men's 400 metres",
       "Athletics Men's 10,000 metres", "Athletics Women's 200 metres",
       "Athletics Men's Decathlon", "Athletics Men's Marathon",
       "Athletics Women's Shot Put", "Athletics Women's 400 metres",
       "Athletics Men's Shot Put", "Athletics Women's Marathon",
       "Athletics Men's 100 metres", "Athletics Women's 100 metres",
       "Athletics Women's 4 x 100 metres Relay",
       "Athletics Men's 200 metres",
       "Athletics Men's 4 x 100 metres Relay",
       "Athletics Men's High Jump", "Athletics Men's Triple Jump",
       "Athletics Women's Heptathlon", "Athletics Women's Javelin Throw",
       "Athletics Women's Pole Vault",
       "Athletics Women's 20 kilometres Walk",
       "Athletics Women's 3,000 metres Steeplechase",
       "Athletics Men's 3,000 metres Steeplechase",
       "Athletics Women's 800 metres", "Athletics Women's 1,500 metres",
       "Athletics Men's Discus Throw", "Athletics Men's 1,500 metres",
       "Athletics Men's 400 metres Hurdles",
       "Athletics Women's Long Jump",
       "Athletics Men's 110 metres Hurdles",
       "Athletics Women's 100 metres Hurdles",
       "Athletics Women's 5,000 metres",
       "Athletics Men's 4 x 400 metres Relay",
       "Athletics Men's Long Jump", "Athletics Men's 800 metres",
       "Athletics Women's High Jump", "Athletics Men's Javelin Throw",
       "Athletics Women's 4 x 400 metres Relay",
       "Athletics Women's 400 metres Hurdles",
       "Athletics Women's Discus Throw", "Athletics Men's Hammer Throw",
       "Athletics Women's 10,000 metres",
       "Athletics Men's 50 kilometres Walk",
       "Athletics Men's 20 kilometres Walk", "Athletics Men's Pole Vault",
       "Athletics Women's Triple Jump", "Athletics Women's Hammer Throw"],
      dtype=object)
In [11]:
removed_events = set(all_athletics_events).difference(modern_athletics_events)
removed_events
Out[11]:
{"Athletics Men's 1,500 metres Walk",
 "Athletics Men's 1,600 metres Medley Relay",
 "Athletics Men's 10 kilometres Walk",
 "Athletics Men's 10 mile Walk",
 "Athletics Men's 2,500 metres Steeplechase",
 "Athletics Men's 2,590 metres Steeplechase",
 "Athletics Men's 200 metres Hurdles",
 "Athletics Men's 3 mile, Team",
 "Athletics Men's 3,000 metres Walk",
 "Athletics Men's 3,000 metres, Team",
 "Athletics Men's 3,200 metres Steeplechase",
 "Athletics Men's 3,500 metres Walk",
 "Athletics Men's 4 mile, Team",
 "Athletics Men's 4,000 metres Steeplechase",
 "Athletics Men's 5 mile",
 "Athletics Men's 5,000 metres, Team",
 "Athletics Men's 56-pound Weight Throw",
 "Athletics Men's 60 metres",
 "Athletics Men's All-Around Championship",
 "Athletics Men's Cross-Country, Individual",
 "Athletics Men's Cross-Country, Team",
 "Athletics Men's Discus Throw, Both Hands",
 "Athletics Men's Discus Throw, Greek Style",
 "Athletics Men's Javelin Throw, Both Hands",
 "Athletics Men's Javelin Throw, Freestyle",
 "Athletics Men's Pentathlon",
 "Athletics Men's Pentathlon (Ancient)",
 "Athletics Men's Shot Put, Both Hands",
 "Athletics Men's Standing High Jump",
 "Athletics Men's Standing Long Jump",
 "Athletics Men's Standing Triple Jump",
 "Athletics Men's Stone Throw",
 "Athletics Women's 10 kilometres Walk",
 "Athletics Women's 3,000 metres",
 "Athletics Women's 80 metres Hurdles",
 "Athletics Women's Pentathlon"}
In [12]:
indices_to_remove = [athletics.index[i] for i in range(len(athletics)) if
                     athletics['Event'].iloc[i] in removed_events]
modern_athletics = athletics.drop(index=indices_to_remove)
modern_athletics['Event'].unique()
Out[12]:
array(["Athletics Women's 100 metres",
       "Athletics Women's 4 x 100 metres Relay",
       "Athletics Men's Shot Put", "Athletics Men's Pole Vault",
       "Athletics Men's High Jump", "Athletics Men's 1,500 metres",
       "Athletics Men's 4 x 100 metres Relay",
       "Athletics Men's Long Jump", "Athletics Women's Javelin Throw",
       "Athletics Men's 5,000 metres", "Athletics Women's Shot Put",
       "Athletics Men's 110 metres Hurdles", "Athletics Women's Marathon",
       "Athletics Men's 100 metres", "Athletics Men's 400 metres Hurdles",
       "Athletics Men's 400 metres", "Athletics Men's Hammer Throw",
       "Athletics Men's 800 metres", "Athletics Men's Marathon",
       "Athletics Men's 4 x 400 metres Relay",
       "Athletics Men's 10,000 metres", "Athletics Men's 200 metres",
       "Athletics Men's Javelin Throw",
       "Athletics Men's 3,000 metres Steeplechase",
       "Athletics Women's 200 metres", "Athletics Men's Triple Jump",
       "Athletics Women's Long Jump", "Athletics Women's 5,000 metres",
       "Athletics Men's Discus Throw", "Athletics Women's 10,000 metres",
       "Athletics Men's Decathlon", "Athletics Women's Discus Throw",
       "Athletics Women's 1,500 metres", "Athletics Women's Pole Vault",
       "Athletics Women's 4 x 400 metres Relay",
       "Athletics Women's 800 metres", "Athletics Women's 400 metres",
       "Athletics Women's 400 metres Hurdles",
       "Athletics Women's High Jump",
       "Athletics Men's 20 kilometres Walk",
       "Athletics Men's 50 kilometres Walk",
       "Athletics Women's 100 metres Hurdles",
       "Athletics Women's Heptathlon",
       "Athletics Women's 3,000 metres Steeplechase",
       "Athletics Women's 20 kilometres Walk",
       "Athletics Women's Triple Jump", "Athletics Women's Hammer Throw"],
      dtype=object)

This analysis will focus on individual running events. So, now remove the field events and non-running events.

In [13]:
# These are the events to keep for the analysis.
modern_individual_running_events = {
    "Athletics Women's 100 metres",
    "Athletics Men's 1,500 metres",
    "Athletics Men's 5,000 metres",
    "Athletics Men's 110 metres Hurdles",
    "Athletics Women's Marathon",
    "Athletics Men's 100 metres",
    "Athletics Men's 400 metres Hurdles",
    "Athletics Men's 400 metres",
    "Athletics Men's 800 metres",
    "Athletics Men's Marathon",
    "Athletics Men's 10,000 metres",
    "Athletics Men's 200 metres",
    "Athletics Men's 3,000 metres Steeplechase",
    "Athletics Women's 200 metres",
    "Athletics Women's 5,000 metres",
    "Athletics Women's 10,000 metres",
    "Athletics Women's 1,500 metres",
    "Athletics Women's 800 metres",
    "Athletics Women's 400 metres",
    "Athletics Women's 400 metres Hurdles",
    "Athletics Women's 100 metres Hurdles",
    "Athletics Women's 3,000 metres Steeplechase"}
In [14]:
removed_events = set(modern_athletics_events).difference(
    modern_individual_running_events)
removed_events
Out[14]:
{"Athletics Men's 20 kilometres Walk",
 "Athletics Men's 4 x 100 metres Relay",
 "Athletics Men's 4 x 400 metres Relay",
 "Athletics Men's 50 kilometres Walk",
 "Athletics Men's Decathlon",
 "Athletics Men's Discus Throw",
 "Athletics Men's Hammer Throw",
 "Athletics Men's High Jump",
 "Athletics Men's Javelin Throw",
 "Athletics Men's Long Jump",
 "Athletics Men's Pole Vault",
 "Athletics Men's Shot Put",
 "Athletics Men's Triple Jump",
 "Athletics Women's 20 kilometres Walk",
 "Athletics Women's 4 x 100 metres Relay",
 "Athletics Women's 4 x 400 metres Relay",
 "Athletics Women's Discus Throw",
 "Athletics Women's Hammer Throw",
 "Athletics Women's Heptathlon",
 "Athletics Women's High Jump",
 "Athletics Women's Javelin Throw",
 "Athletics Women's Long Jump",
 "Athletics Women's Pole Vault",
 "Athletics Women's Shot Put",
 "Athletics Women's Triple Jump"}
In [15]:
indices_to_remove = [modern_athletics.index[i]
                     for i in range(len(modern_athletics)) if
                     modern_athletics['Event'].iloc[i] in removed_events]
ol_running = modern_athletics.drop(index=indices_to_remove)
ol_running.head()
Out[15]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
26 8 Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN Netherlands NED 1932 Summer 1932 Summer Los Angeles Athletics Athletics Women's 100 metres NaN
98 34 Jamale (Djamel-) Aarrass (Ahrass-) M 30.0 187.0 76.0 France FRA 2012 Summer 2012 Summer London Athletics Athletics Men's 1,500 metres NaN
148 55 Antonio Abadia Beci M 26.0 170.0 65.0 Spain ESP 2016 Summer 2016 Summer Rio de Janeiro Athletics Athletics Men's 5,000 metres NaN
190 86 Jos Manuel Abascal Gmez M 22.0 182.0 67.0 Spain ESP 1980 Summer 1980 Summer Moskva Athletics Athletics Men's 1,500 metres NaN
191 86 Jos Manuel Abascal Gmez M 26.0 182.0 67.0 Spain ESP 1984 Summer 1984 Summer Los Angeles Athletics Athletics Men's 1,500 metres Bronze

3.1.2. Missing Values

In [16]:
# Check for missing values in each column.
ol_running.isnull().sum()
Out[16]:
ID            0
Name          0
Sex           0
Age         667
Height     2987
Weight     3131
Team          0
NOC           0
Games         0
Year          0
Season        0
City          0
Sport         0
Event         0
Medal     17951
dtype: int64

Many rows have no entry for a medal, and this is expected - many competitors do not win a medal, so there is no special treatment needed for missing values in the Medal feature. There are also a lot of missing values for height, weight and age, these will be examined now.

In [17]:
age_missing = ol_running[ol_running['Age'].isnull()]
weight_missing = ol_running[ol_running['Weight'].isnull()]
height_missing = ol_running[ol_running['Height'].isnull()]
In [18]:
age_missing.head()
Out[18]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
314 167 Ould Lamine Abdallah M NaN NaN NaN France FRA 1952 Summer 1952 Summer Helsinki Athletics Athletics Men's 10,000 metres NaN
327 179 Ibrahim Saad Abdel Galil M NaN 176.0 73.0 Sudan SUD 1972 Summer 1972 Summer Munich Athletics Athletics Men's 200 metres NaN
353 194 Moustafa Mounib Abdel Kader M NaN 176.0 67.0 United Arab Republic UAR 1960 Summer 1960 Summer Roma Athletics Athletics Men's 100 metres NaN
505 281 S. Abdul Hamid M NaN NaN NaN India IND 1928 Summer 1928 Summer Amsterdam Athletics Athletics Men's 110 metres Hurdles NaN
506 281 S. Abdul Hamid M NaN NaN NaN India IND 1928 Summer 1928 Summer Amsterdam Athletics Athletics Men's 400 metres Hurdles NaN
In [19]:
weight_missing.head()
Out[19]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
26 8 Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN Netherlands NED 1932 Summer 1932 Summer Los Angeles Athletics Athletics Women's 100 metres NaN
215 104 Gana Abba Kimet M 26.0 NaN NaN Chad CHA 1972 Summer 1972 Summer Munich Athletics Athletics Men's 100 metres NaN
314 167 Ould Lamine Abdallah M NaN NaN NaN France FRA 1952 Summer 1952 Summer Helsinki Athletics Athletics Men's 10,000 metres NaN
322 175 Abdelgani Hassan Abdel Fattah M 31.0 NaN NaN Egypt EGY 1952 Summer 1952 Summer Helsinki Athletics Athletics Men's Marathon NaN
429 237 Ben Ahmed Abdelkrim M 20.0 NaN NaN France FRA 1952 Summer 1952 Summer Helsinki Athletics Athletics Men's 5,000 metres NaN
In [20]:
height_missing.head()
Out[20]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
215 104 Gana Abba Kimet M 26.0 NaN NaN Chad CHA 1972 Summer 1972 Summer Munich Athletics Athletics Men's 100 metres NaN
314 167 Ould Lamine Abdallah M NaN NaN NaN France FRA 1952 Summer 1952 Summer Helsinki Athletics Athletics Men's 10,000 metres NaN
322 175 Abdelgani Hassan Abdel Fattah M 31.0 NaN NaN Egypt EGY 1952 Summer 1952 Summer Helsinki Athletics Athletics Men's Marathon NaN
429 237 Ben Ahmed Abdelkrim M 20.0 NaN NaN France FRA 1952 Summer 1952 Summer Helsinki Athletics Athletics Men's 5,000 metres NaN
488 268 Brahim Daoud Abdoulaye M 25.0 NaN 71.0 Chad CHA 1996 Summer 1996 Summer Atlanta Athletics Athletics Men's 200 metres NaN

We now have three groups of rows that have at least one missing value. Now find out if they overlap by using sets:

In [21]:
age_missing_indices = set(age_missing.index)
weight_missing_indices = set(weight_missing.index)
height_missing_indices = set(height_missing.index)
print("The number of rows where both height and weight are missing is {}"
      .format(len(weight_missing_indices.intersection(
                                              height_missing_indices))))
print("The number of rows where both age and weight are missing is {}"
      .format(len(age_missing_indices.intersection(weight_missing_indices))))
print("The number of rows where both age and height are missing is {}"
      .format(len(age_missing_indices.intersection(height_missing_indices))))
print("The number of rows where age, height and weight are missing is {}"
      .format(len(age_missing_indices.intersection(
                                              height_missing_indices,
                                              weight_missing_indices))))
The number of rows where both height and weight are missing is 2961
The number of rows where both age and weight are missing is 537
The number of rows where both age and height are missing is 504
The number of rows where age, height and weight are missing is 504

Of the rows where either height (2987) or weight (3131) are missing, most (2961) of them are missing both height and weight. Of the rows where age is missing (667), most (at least 504) are also missing either weight, height or both. The overlap between the missing data sets is large, which is good news, because it means more of the rows are fully populated, so more of this data is usable without dropping data or imputation. For now, all the data will be kept (not dropping rows with missing data).

3.1.3. Event Labels

The Event feature is a categorical variable. This will be encoded as follows:

  • Strip off the part of the string related to sport (== 'Athletics') - this is not useful anymore.
  • Strip off the gender ("Men's" or "Women's" - because this is already encoded in a separate feature
  • Define a feature 'Track_Flat'. This will cover all the flat (i.e. non-hurdles) track events. It will contain the integer distance of the race in metres, i.e. the 100 m will have a 'Track_Flat' value of 100. Events that are not flat track type will have a 'Track_Flat' value of 0.
  • Define a feature 'Hurdles'. This will be similar to the 'Track_Flat' feature, and will contain the integer distance of the race, or 0 for non-hurdles events.
  • Define a feature 'Road'. This will be the race distance in metrs (42195 m for a marathon, 21098 for a half marathon)
  • Define a feature 'Steeplechase'. This will be 3000 for steeplechase, 0 otherwise.

This method of encoding is chosen because it groups together similar types of events (e.g., hurdles events are treated as a group, flat track events are treated as a separate group) and also separates them by the distance of each event (100m, 200m, etc.)

In [22]:
# Simple string processing in Event column
ol_running['Event'] = ol_running['Event'].str.replace("Athletics Women's ", "")
ol_running['Event'] = ol_running['Event'].str.replace("Athletics Men's ", "")
ol_running['Event'] = ol_running['Event'].str.replace(" metres", "")
ol_running['Event'] = ol_running['Event'].str.replace(",", "")
In [23]:
ol_running.head()
Out[23]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
26 8 Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN Netherlands NED 1932 Summer 1932 Summer Los Angeles Athletics 100 NaN
98 34 Jamale (Djamel-) Aarrass (Ahrass-) M 30.0 187.0 76.0 France FRA 2012 Summer 2012 Summer London Athletics 1500 NaN
148 55 Antonio Abadia Beci M 26.0 170.0 65.0 Spain ESP 2016 Summer 2016 Summer Rio de Janeiro Athletics 5000 NaN
190 86 Jos Manuel Abascal Gmez M 22.0 182.0 67.0 Spain ESP 1980 Summer 1980 Summer Moskva Athletics 1500 NaN
191 86 Jos Manuel Abascal Gmez M 26.0 182.0 67.0 Spain ESP 1984 Summer 1984 Summer Los Angeles Athletics 1500 Bronze

Adding the new columns, copying the values between columns and removing duplicates is repetetive so write a function for this:

In [24]:
def encode_events(df, col, to_replace, replacement):
    """
    Helper function to insert new columns,
    copy and convert values to the correct column
    """
    # Insert new column
    df.insert(df.columns.get_loc('Event'), col, 0)
    # Copy values across to new column
    df.loc[df['Event'].str.contains(to_replace), col] = df[
        'Event'].str.replace(to_replace, replacement)
    # Remove values from original column
    df.loc[df[col] != 0, 'Event'] = '0'
In [25]:
def string_to_int(df, features):
    """
    Helper function to cast string values to integers.
    """
    for feature in features:
        df[feature] = pd.to_numeric(df[feature], downcast='integer')
In [26]:
new_columns = ['Hurdles', 'Road', 'Steeplechase']
to_replace = [' Hurdles', 'Marathon', ' Steeplechase']
replacement = ['', '42195', '']

for i in range(len(new_columns)):
    encode_events(ol_running, new_columns[i], to_replace[i], replacement[i])

ol_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)

3.1.4. Other Processing

Cast Strings to Integers
In [27]:
# Several features now contain strings that would be easier to use as integers.
# Convert these to integers now.
columns_to_int = ['Track_Flat', 'Hurdles', 'Steeplechase', 'Road', 'Year']
string_to_int(ol_running, columns_to_int)
Sex

The other two data sets refer to this as 'Gender'. For ease of comparison, change the name of this feature from 'Sex' to ''Gender'.

In [28]:
ol_running.rename(columns={'Sex': 'Gender'}, inplace=True)
Medal

For ease of comparison with the other data sets, convert 'Gold' to 'G', 'Silver' to 'S', and 'Bronze to 'B'

In [29]:
medals = ['Gold', 'Silver', 'Bronze']
short_medals = ['G', 'S', 'B']

for i in range(len(medals)):
    ol_running.loc[
        ol_running['Medal'] == medals[i], 'Medal'] = ol_running[
        ol_running['Medal'] == medals[i]]['Medal'].str.replace(
                                                    medals[i],
                                                    short_medals[i])
Name

The 'Name' column also looks difficult to use:

In [30]:
ol_running['Name'].head()
Out[30]:
26     Cornelia "Cor" Aalten (-Strannood)
98     Jamale (Djamel-) Aarrass (Ahrass-)
148                   Antonio Abadia Beci
190               Jos Manuel Abascal Gmez
191               Jos Manuel Abascal Gmez
Name: Name, dtype: object

There are alternative names/nicknames in parentheses and double quotes. The intention is to use the names later on, so to make this easier, remove sections in parentheses and double quotes, and convert the name string to lowercase. Make this a function so it can be used on the other data sets later on.

In [31]:
def process_names(df):
    """
    Helper function to perform some cleaning on the athlete Name field.
    """
    df.rename(columns={'Name': 'RawName'}, inplace=True)
    df.insert(loc=df.columns.get_loc('RawName'), column='Name', value=np.NaN)
    df['Name'] = df['RawName'].str.replace('\"(.*?)\"', '')
    df['Name'] = df['Name'].str.replace('\((.*?)\)', '')
    df['Name'] = df['Name'].str.lower()
In [32]:
process_names(ol_running)
In [33]:
ol_running.head()
Out[33]:
ID Name RawName Gender Age Height Weight Team NOC Games Year Season City Sport Hurdles Road Steeplechase Track_Flat Medal
26 8 cornelia aalten Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN Netherlands NED 1932 Summer 1932 Summer Los Angeles Athletics 0 0 0 100 NaN
98 34 jamale aarrass Jamale (Djamel-) Aarrass (Ahrass-) M 30.0 187.0 76.0 France FRA 2012 Summer 2012 Summer London Athletics 0 0 0 1500 NaN
148 55 antonio abadia beci Antonio Abadia Beci M 26.0 170.0 65.0 Spain ESP 2016 Summer 2016 Summer Rio de Janeiro Athletics 0 0 0 5000 NaN
190 86 jos manuel abascal gmez Jos Manuel Abascal Gmez M 22.0 182.0 67.0 Spain ESP 1980 Summer 1980 Summer Moskva Athletics 0 0 0 1500 NaN
191 86 jos manuel abascal gmez Jos Manuel Abascal Gmez M 26.0 182.0 67.0 Spain ESP 1984 Summer 1984 Summer Los Angeles Athletics 0 0 0 1500 B

Wrangling of this data set is complete, and from here on the cleaned data frame will always be called ol_running.

3.2. Olympic Track and Field Data

3.2.1. Events

In [34]:
print("Number of unique events is {}".format(len(ol_tf['Event'].unique())))
Number of unique events is 47
In [35]:
ol_tf.head()
Out[35]:
Gender Event Location Year Medal Name Nationality Result
1 M 10000M Men Rio 2016 G Mohamed FARAH USA 25:05.17
2 M 10000M Men Rio 2016 S Paul Kipngetich TANUI KEN 27:05.64
3 M 10000M Men Rio 2016 B Tamirat TOLA ETH 27:06.26
4 M 10000M Men Beijing 2008 G Kenenisa BEKELE ETH 27:01.17
5 M 10000M Men Beijing 2008 S Sileshi SIHINE ETH 27:02.77
In [36]:
all_ol_tf_events = ol_tf['Event'].unique()
all_ol_tf_events
Out[36]:
array(['10000M Men', '100M Men', '110M Hurdles Men', '1500M Men',
       '200M Men', '20Km Race Walk Men', '3000M Steeplechase Men',
       '400M Hurdles Men', '400M Men', '4X100M Relay Men',
       '4X400M Relay Men', '5000M Men', '50Km Race Walk Men', '800M Men',
       'Decathlon Men', 'Discus Throw Men', 'Hammer Throw Men',
       'High Jump Men', 'Javelin Throw Men', 'Long Jump Men',
       'Marathon Men', 'Pole Vault Men', 'Shot Put Men',
       'Triple Jump Men', '10000M Women', '100M Hurdles Women',
       '100M Women', '1500M Women', '200M Women', '20Km Race Walk Women',
       '3000M Steeplechase Women', '400M Hurdles Women', '400M Women',
       '4X100M Relay Women', '4X400M Relay Women', '5000M Women',
       '800M Women', 'Discus Throw Women', 'Hammer Throw Women',
       'Heptathlon Women', 'High Jump Women', 'Javelin Throw Women',
       'Long Jump Women', 'Marathon Women', 'Pole Vault Women',
       'Shot Put Women', 'Triple Jump Women'], dtype=object)

As in the previous section, this analysis will keep the individual running events and drop the remainder.

In [37]:
ol_tf_running_events = {
                        '10000M Men',
                        '100M Men',
                        '110M Hurdles Men',
                        '1500M Men',
                        '200M Men',
                        '3000M Steeplechase Men',
                        '400M Hurdles Men',
                        '400M Men',
                        '5000M Men',
                        '800M Men',
                        'Marathon Men',
                        '10000M Women',
                        '100M Hurdles Women',
                        '100M Women',
                        '1500M Women',
                        '200M Women',
                        '3000M Steeplechase Women',
                        '400M Hurdles Women',
                        '400M Women',
                        '5000M Women',
                        '800M Women',
                        'Marathon Women'}
In [38]:
indices_to_remove = [ol_tf.index[i] for i in range(len(ol_tf))
                     if not ol_tf['Event'].iloc[i] in ol_tf_running_events]
ol_tf_running = ol_tf.drop(index=indices_to_remove)
ol_tf_running.head()
Out[38]:
Gender Event Location Year Medal Name Nationality Result
1 M 10000M Men Rio 2016 G Mohamed FARAH USA 25:05.17
2 M 10000M Men Rio 2016 S Paul Kipngetich TANUI KEN 27:05.64
3 M 10000M Men Rio 2016 B Tamirat TOLA ETH 27:06.26
4 M 10000M Men Beijing 2008 G Kenenisa BEKELE ETH 27:01.17
5 M 10000M Men Beijing 2008 S Sileshi SIHINE ETH 27:02.77
In [39]:
ol_tf_running['Event'].unique()
Out[39]:
array(['10000M Men', '100M Men', '110M Hurdles Men', '1500M Men',
       '200M Men', '3000M Steeplechase Men', '400M Hurdles Men',
       '400M Men', '5000M Men', '800M Men', 'Marathon Men',
       '10000M Women', '100M Hurdles Women', '100M Women', '1500M Women',
       '200M Women', '3000M Steeplechase Women', '400M Hurdles Women',
       '400M Women', '5000M Women', '800M Women', 'Marathon Women'],
      dtype=object)

In addition, this data set contains results for a Men's 3000m steeplechase in 1900 and 1904. However, this is an error in the data - the 1900 and 1904 Olympics featured shorter steeplechase ditances (source: https://en.wikipedia.org/wiki/Steeplechase_(athletics)). Therefore the rows for 3000M Steeplechase Men for 1900 and 1904 will be removed.

In [40]:
drop_steeplechase = ol_tf_running[
    ((ol_tf_running['Year'] == '1900') |
     (ol_tf_running['Year'] == '1904')) &
    (ol_tf_running['Event'] == '3000M Steeplechase Men')].index.tolist()
ol_tf_running.drop(index=drop_steeplechase, inplace=True)

This now contains the data of interest.

3.2.2. Missing Values

In [41]:
# Check for missing values in each column.
ol_tf_running.isnull().sum()
Out[41]:
Gender         0
Event          0
Location       0
Year           0
Medal          0
Name           0
Nationality    0
Result         0
dtype: int64

No missing values are shown but this is deceptive, since some of the 'Result' fields conatin the string 'None'.

In [42]:
ol_tf_running[ol_tf_running['Result'] == 'None'].head()
Out[42]:
Gender Event Location Year Medal Name Nationality Result
107 M 100M Men London 1908 S John RECTOR USA None
108 M 100M Men London 1908 B Robert KERR CAN None
190 M 110M Hurdles Men London 1908 B Arthur SHAW USA None
300 M 1500M Men Los Angeles 1932 B Philip EDWARDS CAN None
309 M 1500M Men St Louis 1904 B Lacey HEARN USA None
In [43]:
ol_tf_running.loc[ol_tf_running['Result'] == 'None', 'Result'] = pd.NaT
ol_tf_running.dropna(subset=['Result'], inplace=True)

3.2.3. Event Labels

The same approach will be used as in the previous section so that the data sets end up with a consistent set of labels for each event.

In [44]:
# Simple string processing in Event column
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("Women", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("Men", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("M ", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace(",", "")

new_columns = ['Hurdles', 'Road', 'Steeplechase']
to_replace = ['Hurdles', 'Marathon', 'Steeplechase']
replacement = ['', '42195', '']

for i in range(len(new_columns)):
    encode_events(ol_tf_running, new_columns[i], to_replace[i], replacement[i])

ol_tf_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)

3.2.4. Results

The aim is to convert the Results string to a datetime object, extract the time from this and store it in a feature called 'Time'. The time formats vary a lot in this data set so some cleaning is needed.

It's possible to create general groups of events that share similar formats.

In [45]:
# Hurdle events
ol_tf_running_hurdles_groups = ol_tf_running.groupby('Hurdles')
# Road running events
ol_tf_running_road_groups = ol_tf_running.groupby('Road')
# Steeplechase
ol_tf_running_steeplechase_groups = ol_tf_running.groupby('Steeplechase')
# Track (flat) events
ol_tf_running_trackf_groups = ol_tf_running.groupby('Track_Flat')

event_groups = [ol_tf_running_hurdles_groups,
                ol_tf_running_road_groups,
                ol_tf_running_steeplechase_groups,
                ol_tf_running_trackf_groups]

for group in event_groups:
    # Ignore the first event in each category where distance=0
    for event in list(group.groups.keys())[1:]:
        print("Event: {}".format(event))
        print(group.get_group(event)['Result'].head(3))
Event: 100 
1654    12.48
1655    12.59
1656    12.61
Name: Result, dtype: object
Event: 110 
152    13.05
153    13.17
154    13.24
Name: Result, dtype: object
Event: 400 
511    47.73
512    47.78
513    47.92
Name: Result, dtype: object
Event: 42195 
1376    02:08:44
1377     2:09:54
1378     2:10:05
Name: Result, dtype: object
Event: 3000 
439    8:03.28
440    8:04.28
441    8:11.52
Name: Result, dtype: object
Event: 100
70    9.81
71    9.89
72    9.91
Name: Result, dtype: object
Event: 10000
1    25:05.17
2    27:05.64
3    27:06.26
Name: Result, dtype: object
Event: 1500
232    3:50.00
233    3:50.11
234    3:50.24
Name: Result, dtype: object
Event: 200
313    19.78
314    20.02
315    20.12
Name: Result, dtype: object
Event: 400
586    43.03
587    43.76
588    43.85
Name: Result, dtype: object
Event: 5000
804    13:03.30
805    13:03.90
806    13:04.35
Name: Result, dtype: object
Event: 800
930    1:42.15
931    1:42.61
932    1:42.93
Name: Result, dtype: object

This shows it's possible to define three time formats in this result set:

In [46]:
# Time format for the sprint events
time_format_sprints = '%S.%f'

# Time format for middle distance events
time_format_middle = '%M:%S.%f'

# Time format for long distance events
time_format_long = '%H:%M:%S'

Examining each event in more detail shows that some further processing is needed.

Steeplechase

In [47]:
ol_tf_running_steeplechase_groups.get_group('3000 ')['Result'].head()
Out[47]:
439    8:03.28
440    8:04.28
441    8:11.52
442    8:10.34
443    8:10.49
Name: Result, dtype: object
In [48]:
# Convert to datetime and extract the time part only.
ol_tf_running.loc[
    ol_tf_running['Steeplechase'] == '3000 ', 'Time'] = pd.to_datetime(
    ol_tf_running[ol_tf_running['Steeplechase'] == '3000 ']['Result'],
    format=time_format_middle).apply(datetime.time)

Hurdles

In [49]:
ol_tf_running_hurdles_groups.get_group('100 ')['Result'].head()
Out[49]:
1654    12.48
1655    12.59
1656    12.61
1657    12.54
1658    12.64
Name: Result, dtype: object
In [50]:
ol_tf_running_hurdles_groups.get_group('110 ')['Result'].head()
Out[50]:
152    13.05
153    13.17
154    13.24
155    12.93
156    13.17
Name: Result, dtype: object
In [51]:
ol_tf_running_hurdles_groups.get_group('400 ')['Result'].head()
Out[51]:
511    47.73
512    47.78
513    47.92
514    47.25
515    47.98
Name: Result, dtype: object

In addition, some of the time strings have a leading '0:':

In [52]:
ol_tf_running[ol_tf_running['Result'] == '0:54.0']
Out[52]:
Gender Hurdles Road Steeplechase Track_Flat Location Year Medal Name Nationality Result Time
544 M 400 0 0 0 Antwerp 1920 G Frank LOOMIS USA 0:54.0 NaN
In [53]:
# Remove leading '0:':
ol_tf_running.loc[
    ol_tf_running['Hurdles'] == '400 ', 'Result'] = ol_tf_running[
    ol_tf_running['Hurdles'] == '400 ']['Result'].str.replace('0:', '')
In [54]:
# For all Hurdles distances convert to datetime and extract time part.
events = list(ol_tf_running_hurdles_groups.groups.keys())
# Ignore the fist event in each category where distance=0
events.remove(0)
for event in events:
    ol_tf_running.loc[
        ol_tf_running['Hurdles'] == event, 'Time'] = pd.to_datetime(
        ol_tf_running[ol_tf_running['Hurdles'] == event]['Result'],
        format=time_format_sprints).apply(datetime.time)

Track (Flat)

In [55]:
ol_tf_running_trackf_groups.get_group('100')['Result'].head()
Out[55]:
70    9.81
71    9.89
72    9.91
73    9.69
74    9.89
Name: Result, dtype: object
In [56]:
ol_tf_running_trackf_groups.get_group('200')['Result'].head()
Out[56]:
313    19.78
314    20.02
315    20.12
316    19.30
317    19.96
Name: Result, dtype: object
In [57]:
ol_tf_running_trackf_groups.get_group('400')['Result'].head()
Out[57]:
586    43.03
587    43.76
588    43.85
589    43.75
590    44.74
Name: Result, dtype: object
In [58]:
ol_tf_running_trackf_groups.get_group('800')['Result'].head()
Out[58]:
930    1:42.15
931    1:42.61
932    1:42.93
933    1:44.65
934    1:44.70
Name: Result, dtype: object
In [59]:
ol_tf_running_trackf_groups.get_group('1500')['Result'].head()
Out[59]:
232    3:50.00
233    3:50.11
234    3:50.24
235    3:33.11
236    3:34.16
Name: Result, dtype: object
In [60]:
ol_tf_running_trackf_groups.get_group('5000')['Result'].head()
Out[60]:
804    13:03.30
805    13:03.90
806    13:04.35
807    12:57.82
808    13:02.80
Name: Result, dtype: object
In [61]:
ol_tf_running_trackf_groups.get_group('10000')['Result'].head()
Out[61]:
1    25:05.17
2    27:05.64
3    27:06.26
4    27:01.17
5    27:02.77
Name: Result, dtype: object

Track events for distances less than 800m all have times written in the format defined in time_format_sprints. 800m and above use the format defined in time_format_middle.

In [62]:
sprint_distances = ['100', '200', '400']
middle_distances = ['800', '1500', '5000', '10000']

As with the hurdles distances above, remove any leading '0:':

In [63]:
# Remove leading '0:':
for event in sprint_distances:
    ol_tf_running.loc[
        ol_tf_running['Track_Flat'] == event, 'Result'] = ol_tf_running[
        ol_tf_running['Track_Flat'] == event]['Result'].str.replace('0:', '')
In [64]:
# For track sprint events convert to datetime and extract time part only.
for event in sprint_distances:
    ol_tf_running.loc[
        ol_tf_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
        ol_tf_running[ol_tf_running['Track_Flat'] == event]['Result'],
        format=time_format_sprints).apply(datetime.time)
In [65]:
# For track middle distance events convert to datetime and extract time.
for event in middle_distances:
    ol_tf_running.loc[
        ol_tf_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
        ol_tf_running[ol_tf_running['Track_Flat'] == event]['Result'],
        format=time_format_middle).apply(datetime.time)

Road

In [66]:
ol_tf_running_road_groups.get_group('42195 ').head()
Out[66]:
Gender Hurdles Road Steeplechase Track_Flat Location Year Medal Name Nationality Result Time
1376 M 0 42195 0 0 Rio 2016 G Eliud Kipchoge ROTICH KEN 02:08:44 NaN
1377 M 0 42195 0 0 Rio 2016 S Feyisa LILESA ETH 2:09:54 NaN
1378 M 0 42195 0 0 Rio 2016 B Galen RUPP USA 2:10:05 NaN
1379 M 0 42195 0 0 Beijing 2008 G Samuel Kamau WANJIRU KEN 2h06:32 NaN
1380 M 0 42195 0 0 Beijing 2008 S Jaouad GHARIB MAR 2h07:16 NaN

Some specific examples show there are several problems:

In [67]:
ol_tf_running_road_groups.get_group('42195 ').loc[[1379]]
Out[67]:
Gender Hurdles Road Steeplechase Track_Flat Location Year Medal Name Nationality Result Time
1379 M 0 42195 0 0 Beijing 2008 G Samuel Kamau WANJIRU KEN 2h06:32 NaN
In [68]:
ol_tf_running_road_groups.get_group('42195 ').loc[[1392]]
Out[68]:
Gender Hurdles Road Steeplechase Track_Flat Location Year Medal Name Nationality Result Time
1392 M 0 42195 0 0 Montreal 1976 S Frank Charles SHORTER USA 2:10:45.8 NaN
In [69]:
ol_tf_running_road_groups.get_group('42195 ').loc[[1417]]
Out[69]:
Gender Hurdles Road Steeplechase Track_Flat Location Year Medal Name Nationality Result Time
1417 M 0 42195 0 0 Paris 1900 B Ernst FAST SWE 3-37:14.0 NaN

This shows several formatting problems:

  • some of the values for marathon times include an 'h' for hours instead of a ':' Remove these and replace with a colon:
  • sometimes there is a milliseconds value, sometimes not
  • sometimes '-' is used instead of ':'

Taking these in turn:

In [70]:
# Remove 'h'
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[
    ol_tf_running['Road'] == '42195 ']['Result'].str.replace('h', ':')
In [71]:
# Remove milliseconds:
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[
    ol_tf_running['Road'] == '42195 ']['Result'].str.replace('\..*', '')
In [72]:
# Replace '-' with ':'
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[
    ol_tf_running['Road'] == '42195 ']['Result'].str.replace('-', ':')
In [73]:
ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'].head()
Out[73]:
1376    02:08:44
1377     2:09:54
1378     2:10:05
1379     2:06:32
1380     2:07:16
Name: Result, dtype: object

There are also some values that only include hours and minutes:

In [74]:
ol_tf_running[ol_tf_running['Result'] == '2:32']
Out[74]:
Gender Hurdles Road Steeplechase Track_Flat Location Year Medal Name Nationality Result Time
1406 M 0 42195 0 0 Amsterdam 1928 G Boughèra EL OUAFI FRA 2:32 NaN
In [75]:
for i in ol_tf_running[ol_tf_running['Road'] == '42195 '].index:
    if len(ol_tf_running['Result'].loc[i].split(':')) < 3:
        ol_tf_running['Result'].loc[i] = ol_tf_running['Result'].loc[i] + ':00'
In [76]:
# Convert to datetime and extract the time part only.
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Time'] = pd.to_datetime(
    ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'],
    format=time_format_long).apply(datetime.time)

3.2.5. Other Processing

Cast Strings to Integers
In [77]:
# Several features now contain strings that would be easier to use as integers.
# Convert these to integers now.
columns_to_int = ['Track_Flat', 'Hurdles', 'Steeplechase', 'Road', 'Year']
string_to_int(ol_tf_running, columns_to_int)
Name

Some names include nicknames in double quotes. There is also a string encoding problem causing some characters to be displayed wrongly. For example, 'Emil ZÁTOPEK' and 'Katrin DÖRRE' below:

In [78]:
ol_tf_running['Name'].loc[25]
Out[78]:
'Emil ZÃ\x81TOPEK'
In [79]:
ol_tf_running['Name'].loc[2322]
Out[79]:
'Katrin DÃ\x96RRE'
In [80]:
# Check encoding of the file
with open("datasets/results.csv", 'rb') as file:
    print(chardet.detect(file.read()))
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

So chardet still suggests the file is utf-8 encoded. So we can try to clean this up by using the ftfy package to fix the bad encodings (Reference for ftfy: https://ftfy.readthedocs.io/en/latest/)

In [81]:
ol_tf_running['Name'] = ol_tf_running['Name'].apply(ftfy.fix_encoding)
In [82]:
ol_tf_running['Name'].loc[25]
Out[82]:
'Emil ZÁTOPEK'
In [83]:
ol_tf_running['Name'].loc[2322]
Out[83]:
'Katrin DÖRRE'

This shows the bad encodings have disappeared:

  • 'Emil ZÁTOPEK' -> 'Emil ZÁTOPEK'
  • 'Katrin DÖRRE' -> 'Katrin DÖRRE'

Other name text processing is the same as the previous section

In [84]:
# Use the processing function defined previously
process_names(ol_tf_running)
In [85]:
ol_tf_running['Name'].head()
Out[85]:
1            mohamed farah
2    paul kipngetich tanui
3             tamirat tola
4          kenenisa bekele
5           sileshi sihine
Name: Name, dtype: object
Gender

Female athletes are categorised as 'W' in the 'Gender' column. Change this to be 'F' for consistency with the other data sets.

In [86]:
ol_tf_running['Gender'] = ol_tf_running['Gender'].str.replace('W', 'F')

This concludes cleaning of the second data set, which will be named ol_tf_running from here on.

3.3. Top 1000 Running Times

3.3.1. Events

Select individual running events as with the previous two data sets.

In [87]:
print("Number of unique events is {}"
      .format(len(top_running['Event'].unique())))
top_running['Event'].unique()
Number of unique events is 9
Out[87]:
array(['800 m', '400 m', '10,000 m', 'Marathon', '100 m', 'Half marathon',
       '200 m', '1500 m', '5000 m'], dtype=object)

These are all valid events for this analysis. No need to remove any.

3.3.2. Missing Values

In [88]:
top_running.isnull().sum()
Out[88]:
Rank             0
Time             0
Name             0
Country          0
Date of Birth    0
Place            8
City             0
Date             0
Gender           0
Event            0
dtype: int64

There are a few missing 'Place' values. This anlysis will not use this feature and it will not be included any further analysis anyway. No further action on this for now.

In [89]:
top_running.head()
Out[89]:
Rank Time Name Country Date of Birth Place City Date Gender Event
0 1 00:01:40.910000 David Rudisha KEN 1988-12-17 1.0 London 2012-09-08 Men 800 m
1 2 00:01:41.010000 David Rudisha KEN 1988-12-17 1.0 Rieti 2010-08-29 Men 800 m
2 3 00:01:41.090000 David Rudisha KEN 1988-12-17 1.0 Berlin 2010-08-22 Men 800 m
3 4 00:01:41.110000 Wilson Kipketer DEN 1970-12-12 1.0 Köln 1997-08-24 Men 800 m
4 5 00:01:41.240000 Wilson Kipketer DEN 1970-12-12 1.0 Zürich 1997-08-13 Men 800 m

3.3.3. Event Labels

The same approach will be used as in the previous section so that the data sets end up with a consistent set of labels for each event.

In [90]:
# Simple string processing in Event column

# Replace strings ('Marathon', 'Half marathon') with distance in metres:
racetype = ['Marathon', 'Half marathon']
distance = ['42195 Road', '21098 Road']

for i in range(len(racetype)):
    top_running.loc[
        top_running['Event'] == racetype[i], 'Event'] = top_running[
        top_running['Event'] == racetype[i]]['Event'].str.replace(
                                                        racetype[i],
                                                        distance[i])

top_running['Event'] = top_running['Event'].str.replace(",", "")
new_columns = ['Road']
to_replace = [' Road']
replacement = ['']

for i in range(len(new_columns)):
    encode_events(top_running, new_columns[i], to_replace[i], replacement[i])

top_running['Event'] = top_running['Event'].str.replace(" m", "")

top_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)
In [91]:
top_running.head()
Out[91]:
Rank Time Name Country Date of Birth Place City Date Gender Road Track_Flat
0 1 00:01:40.910000 David Rudisha KEN 1988-12-17 1.0 London 2012-09-08 Men 0 800
1 2 00:01:41.010000 David Rudisha KEN 1988-12-17 1.0 Rieti 2010-08-29 Men 0 800
2 3 00:01:41.090000 David Rudisha KEN 1988-12-17 1.0 Berlin 2010-08-22 Men 0 800
3 4 00:01:41.110000 Wilson Kipketer DEN 1970-12-12 1.0 Köln 1997-08-24 Men 0 800
4 5 00:01:41.240000 Wilson Kipketer DEN 1970-12-12 1.0 Zürich 1997-08-13 Men 0 800

3.3.4. Other Processing

Date
In [92]:
# In column 'Date', year will be used as one of the keys to merge data sets.
# Therefore, create a separate 'Year' column and populate it.
top_running.insert(top_running.columns.get_loc('Date'), 'Year', 0)
top_running['Year'] = top_running['Date'].str.split("-", expand=True)[0]
Cast Strings to Integers
In [93]:
# Convert strings in several features to integers now.
columns_to_int = ['Track_Flat', 'Road', 'Year']
string_to_int(top_running, columns_to_int)
Time

Convert the string into a datetime object. First look at what the different time formats used in each event are:

In [94]:
# Road running events
top_running_road_groups = top_running.groupby('Road')
# Track (flat) events
top_running_trackf_groups = top_running.groupby('Track_Flat')

event_groups = [top_running_road_groups,
                top_running_trackf_groups]

for group in event_groups:
    # Ignore the first event in each category where distance=0
    for event in list(group.groups.keys())[1:]:
        print("Event: {}".format(event))
        print(group.get_group(event)['Time'].head(3))
Event: 21098
8182    00:58:23
8183    00:58:31
8184    00:58:33
Name: Time, dtype: object
Event: 42195
3001    02:02:57
3002    02:03:02
3003    02:03:03
Name: Time, dtype: object
Event: 100
4000    00:00:10.490000
4001    00:00:10.610000
4002    00:00:10.620000
Name: Time, dtype: object
Event: 200
9191    00:00:21.340000
9192    00:00:21.560000
9193    00:00:21.620000
Name: Time, dtype: object
Event: 400
1000    00:00:47.600000
1001    00:00:47.990000
1002    00:00:48.160000
Name: Time, dtype: object
Event: 800
0    00:01:40.910000
1    00:01:41.010000
2    00:01:41.090000
Name: Time, dtype: object
Event: 1500
11230    00:03:50.070000
11231    00:03:50.460000
11232    00:03:50.980000
Name: Time, dtype: object
Event: 5000
12232    00:12:37.350000
12233    00:12:39.360000
12234    00:12:39.740000
Name: Time, dtype: object
Event: 10000
2007    00:26:17.530000
2008    00:26:20.310000
2009    00:26:22.750000
Name: Time, dtype: object

For the road running events, the time format is the same as already defined in time_format_long above. For the track events, most of the times have the same format ('%H:%M:%S.%f'), but there are occassional cases where the milliseconds field is missing. For these cases it's possible to use the infer_datetime_format feature of pandas.to_datetime().

In [95]:
# Convert strings in the Time column to datetime objects.
# Convert to datetime and extract the time part only.
events = top_running['Road'].unique().tolist()
events.remove(0)
for event in events:
    top_running.loc[
        top_running['Road'] == event, 'Time'] = pd.to_datetime(
        top_running[top_running['Road'] == event]['Time'],
        format=time_format_long).apply(datetime.time)
In [96]:
# Convert strings in the Time column to datetime objects.
# Convert to datetime and extract the time part only.
events = top_running['Track_Flat'].unique().tolist()
events.remove(0)
for event in events:
    top_running.loc[
        top_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
        top_running[top_running['Track_Flat'] == event]['Time'],
        infer_datetime_format=True).apply(datetime.time)
Name

Looking at the names in this data set - they seem straightforward:

In [97]:
top_running['Name'].head()
Out[97]:
0      David Rudisha
1      David Rudisha
2      David Rudisha
3    Wilson Kipketer
4    Wilson Kipketer
Name: Name, dtype: object

Other name text processing is the same as the previous section

In [98]:
# Use the processing function defined previously
process_names(top_running)
Gender

To be consistent with the other data sets, change the possible values of the 'Gender' feature to be either 'M' or 'F' instead of 'Men' or 'Women'.

In [99]:
top_running['Gender'] = ['M' if top_running['Gender'].iloc[i] == 'Men'
                         else 'F' for i in top_running.index]
In [100]:
top_running.head()
Out[100]:
Rank Time Name RawName Country Date of Birth Place City Year Date Gender Road Track_Flat
0 1 00:01:40.910000 david rudisha David Rudisha KEN 1988-12-17 1.0 London 2012 2012-09-08 M 0 800
1 2 00:01:41.010000 david rudisha David Rudisha KEN 1988-12-17 1.0 Rieti 2010 2010-08-29 M 0 800
2 3 00:01:41.090000 david rudisha David Rudisha KEN 1988-12-17 1.0 Berlin 2010 2010-08-22 M 0 800
3 4 00:01:41.110000 wilson kipketer Wilson Kipketer DEN 1970-12-12 1.0 Köln 1997 1997-08-24 M 0 800
4 5 00:01:41.240000 wilson kipketer Wilson Kipketer DEN 1970-12-12 1.0 Zürich 1997 1997-08-13 M 0 800
Olympic Label

Later in this analysis, times from this data set will be merged into the Olympic data set. To facilitate this, it is necessary to label which rows correspond to an Olympic Games. This will be done by comparing the 'Date' field of the result to the known dates of the Olympic Games.

In [101]:
# Convert dates to datetime format
top_running['Date'] = pd.to_datetime(
    top_running['Date'], infer_datetime_format=True)
In [102]:
# What's the earliest year in the top_running data set?
min(top_running['Year'].tolist())
Out[102]:
1962

So there is no need to look at years before 1962.

In [103]:
# List of dates of Olympic summer games
# Source: https://en.wikipedia.org/wiki/Summer_Olympic_Games
# Use format Year-month-day
olympic_dates = [
                ['1964-10-10', '1964-10-24'],
                ['1968-10-12', '1968-10-27'],
                ['1972-08-26', '1972-09-10'],
                ['1976-07-17', '1976-08-01'],
                ['1980-07-19', '1980-08-03'],
                ['1984-07-28', '1984-08-12'],
                ['1988-09-17', '1988-10-02'],
                ['1992-07-25', '1992-08-09'],
                ['1996-07-19', '1996-08-04'],
                ['2000-09-15', '2000-10-01'],
                ['2004-08-13', '2004-08-29'],
                ['2008-08-08', '2008-08-24'],
                ['2012-07-27', '2012-08-12'],
                ['2016-08-05', '2016-08-21']
                ]

olympic_dates_df = pd.DataFrame(olympic_dates, columns=['Start', 'End'],
                                index=[1964,
                                       1968,
                                       1972,
                                       1976,
                                       1980,
                                       1984,
                                       1988,
                                       1992,
                                       1996,
                                       2000,
                                       2004,
                                       2008,
                                       2012,
                                       2016])
olympic_dates_df['Start'] = pd.to_datetime(
    olympic_dates_df['Start'], format='%Y-%m-%d')
olympic_dates_df['End'] = pd.to_datetime(
    olympic_dates_df['End'], format='%Y-%m-%d')
In [104]:
top_running.insert(loc=top_running.columns.get_loc('Date'), 
                   column='Olympics', value=False)
In [105]:
for y in olympic_dates_df.index:
    top_running.loc[
        top_running['Year'] == y, 'Olympics'] = (top_running['Year'] == y) & (
        top_running['Date'] >= olympic_dates_df.loc[y, 'Start']) & (
        top_running['Date'] <= olympic_dates_df.loc[y, 'End'])
In [106]:
top_running[top_running['Olympics']== True].head()
Out[106]:
Rank Time Name RawName Country Date of Birth Place City Year Olympics Date Gender Road Track_Flat
17 18 00:01:42.150000 david rudisha David Rudisha KEN 1988-12-17 1.0 Rio de Janeiro 2016 True 2016-08-15 M 0 800
41 42 00:01:42.580000 vebjørn rodal Vebjørn Rodal NOR 1972-09-16 1.0 Atlanta 1996 True 1996-07-31 M 0 800
47 45 00:01:42.610000 taoufik makhloufi Taoufik Makhloufi ALG 1988-04-29 2.0 Rio de Janeiro 2016 True 2016-08-15 M 0 800
55 56 00:01:42.740000 hezekiél sepeng Hezekiél Sepeng RSA 1974-06-30 2.0 Atlanta 1996 True 1996-07-31 M 0 800
62 63 00:01:42.790000 frederick onyancha Frederick Onyancha KEN 1969-12-25 3.0 Atlanta 1996 True 1996-07-31 M 0 800
Top 10 Label

It is useful to label the top 10 performances in each event, for each gender, and for every year. This is because the data set includes the top 1000 performances for all events, and since the main concern for this analysis is the factors affecting the improvement of performances, it is worth identifying the top 10 performances in each year.

In [107]:
top_running.insert(loc=top_running.columns.get_loc('Time'),
                   column='Top 10', value=False)

event_categories = ['Track_Flat', 'Road']

for gender in top_running['Gender'].unique().tolist():
    for category in event_categories:
        events = top_running[category].unique().tolist()
        events.remove(0)
        events.sort()
        for event in events:
            for year in top_running[
                    top_running[category] == event]['Year'].unique().tolist():
                best_times_per_year = top_running[
                    (top_running[category] == event) &
                    (top_running['Year'] == year) &
                    (top_running['Gender'] == gender)]['Time'].tolist()
                if len(best_times_per_year) > 0:
                    if len(best_times_per_year) >= 10:
                        cutoff = sorted(best_times_per_year)[9]
                    else:
                        cutoff = sorted(
                            best_times_per_year)[len(best_times_per_year)-1]
                    top_running.loc[
                        (top_running[category] == event) &
                        (top_running['Year'] == year) &
                        (top_running['Gender'] == gender),
                        'Top 10'] = top_running.loc[
                        (top_running[category] == event) &
                        (top_running['Year'] == year) &
                        (top_running['Gender'] == gender)]['Time'] <= cutoff
                else:
                    continue

Perform a check that this has worked:

In [108]:
top_running[(top_running['Road'] == 42195) & (
    top_running['Top 10'] == True) & (
    top_running['Year'] == 2012)].head()
Out[108]:
Rank Top 10 Time Name RawName Country Date of Birth Place City Year Olympics Date Gender Road Track_Flat
3020 20 True 02:04:15 geoffrey mutai Geoffrey Mutai KEN 1981-10-07 1.0 Berlin 2012 False 2012-09-30 M 42195 0
3021 21 True 02:04:16 dennis kimetto Dennis Kimetto KEN 1984-04-22 2.0 Berlin 2012 False 2012-09-30 M 42195 0
3022 22 True 02:04:23 ayele absehero Ayele Absehero ETH 1990-12-28 1.0 Dubai 2012 False 2012-01-27 M 42195 0
3032 32 True 02:04:38 tsegay kebede Tsegay Kebede ETH 1987-01-15 1.0 Chicago 2012 False 2012-10-07 M 42195 0
3035 35 True 02:04:44 wilson kipsang kiprotich Wilson Kipsang Kiprotich KEN 1982-03-15 1.0 London 2012 False 2012-04-22 M 42195 0

This concludes the processing of this data set, and the data frame will be named top_running from this point on.

3.4. Merging Data Sets

The full Olympic data set has information about athlete characteristics but no times or results. Both the track and field data set and the top running times data set have times and results, but no athlete data. So to answer questions about how results and athlete characteristics are related it is necessary to merge these data sets. Athletes often compete in multiple Olympic Games and in different events, so it will be necessary to find a match based on the year, the event, medal awardd and the athlete's name. It will be straightforward to match the year across both data sets, and also the events and medals, because the labels are already standardised. The name presents an additional challenge because it is written differently in each data set for some athletes appearing in both. For example, here is how Mo Farah's performance in the 10000 m in 2016 looks in the Olympic track and field data set:

In [109]:
ol_tf.loc[[1]]
Out[109]:
Gender Event Location Year Medal Name Nationality Result
1 M 10000M Men Rio 2016 G Mohamed FARAH USA 25:05.17

Compare the way his name is written to the way it appears for the same performance in the full Olympic data set:

In [110]:
ol_running.loc[[66487]]
Out[110]:
ID Name RawName Gender Age Height Weight Team NOC Games Year Season City Sport Hurdles Road Steeplechase Track_Flat Medal
66487 34012 mohamed muktar jama farah Mohamed Muktar Jama "Mo" Farah M 33.0 175.0 58.0 Great Britain GBR 2016 Summer 2016 Summer Rio de Janeiro Athletics 0 0 0 10000 G

Row 66487 contains Mo Farah's performance matching the one in the Olympic track and field data, but the name is written very differently. To overcome this, a method called fuzzy matching will be used.

The aim is to merge the time data into the ol_running data frame, where it is available.

In [111]:
ol_running.head()
Out[111]:
ID Name RawName Gender Age Height Weight Team NOC Games Year Season City Sport Hurdles Road Steeplechase Track_Flat Medal
26 8 cornelia aalten Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN Netherlands NED 1932 Summer 1932 Summer Los Angeles Athletics 0 0 0 100 NaN
98 34 jamale aarrass Jamale (Djamel-) Aarrass (Ahrass-) M 30.0 187.0 76.0 France FRA 2012 Summer 2012 Summer London Athletics 0 0 0 1500 NaN
148 55 antonio abadia beci Antonio Abadia Beci M 26.0 170.0 65.0 Spain ESP 2016 Summer 2016 Summer Rio de Janeiro Athletics 0 0 0 5000 NaN
190 86 jos manuel abascal gmez Jos Manuel Abascal Gmez M 22.0 182.0 67.0 Spain ESP 1980 Summer 1980 Summer Moskva Athletics 0 0 0 1500 NaN
191 86 jos manuel abascal gmez Jos Manuel Abascal Gmez M 26.0 182.0 67.0 Spain ESP 1984 Summer 1984 Summer Los Angeles Athletics 0 0 0 1500 B

Add two columns to ol_running, one for the time and one for the merged-in name, which can be used as a sanity check for the data merging process.

In [112]:
ol_running.insert(loc=len(ol_running.columns), column='Time', value=pd.NaT)
ol_running.insert(loc=ol_running.columns.get_loc('RawName'),
                  column='Merged_name', value=np.NaN)
ol_running.insert(loc=ol_running.columns.get_loc('RawName'),
                  column='Ratio', value=np.NaN)

Now define a function to merge the times from the ol_tf_running data set into the ol_running data set. This function splits the results in each data set into groups by event, year, gender and medal awarded. This cuts the full results set into much smaller and more manageable groups. Every pass of the loop examines a pair of corresponding groups, one from each data set. Each group of results is for the same set of event, year, gender and medal awarded. The function then compares the names in each. If the strings don't match in a simple way (using str.find(), then it applies the fuzzy matching algorithm to find the best match (process.extractOne()). If the match ratio between the two strings being compared is above a threshold (chosen arbitrarily as 50) then use the two rows being compared as a match, and save the time, name and ratio in the ol_running dataset.

In [113]:
def merge_times(df, event_categories, debug=False):
    """
    Helper function to merge times from one dataset into the
    ol_running dataset. Event, year, gender, medal and athlete name
    are used as inputs to match athlete data from
    one data frame to the same athlete's performance
    in the other data frame.
    Names are matched using fuzzy string matching.

    Input parameters:
    df               - data frame to merge
    event_categories - list of categories of events
    debug            - True/False flag switching debug on/off.

    Returns:
    None
    """
    for category in event_categories:
        if debug:
            print(category)
        ol_running_groups = ol_running.groupby(
            [category, 'Year', 'Gender', 'Medal'])
        df_groups = df.groupby([category, 'Year', 'Gender', 'Medal'])
        events = ol_running[category].unique().tolist()
        events.remove(0)
        for event in events:
            if debug:
                print(event)
            for gender in ol_running['Gender'].unique().tolist():
                if debug:
                    print(gender)
                for year in ol_running['Year'].unique().tolist():
                    if debug:
                        print(year)
                    for medal in ol_running['Medal'].unique().tolist():
                        if debug:
                            print(medal)
                        try:
                            group_1 = ol_running_groups.get_group(
                                (event, year, gender, medal))
                        except KeyError:
                            if debug:
                                print("No results in ol_running_groups")
                            continue
                        try:
                            group_2 = df_groups.get_group(
                                (event, year, gender, medal))
                        except KeyError:
                            if debug:
                                print("No results in df_groups")
                            continue
                        name_options = group_1['Name'].tolist()
                        for name in group_2['Name']:
                            find_result = group_1['Name'].str.find(name)
                            i = find_result[find_result > -1].index
                            if debug:
                                print(i)
                            if(i.any()):
                                if debug:
                                    print("str.find found a match: {}"
                                          .format(name))
                                # Don't replace a time if one exists
                                if pd.isnull(
                                        ol_running.loc[i]['Time'].tolist()):
                                    ol_running.loc[i, 'Time'] = group_2.loc[
                                        group_2['Name'] == name][
                                                            'Time'].tolist()
                                    ol_running.loc[i, 'Merged_name'] = name
                            else:
                                if debug:
                                    print("str.find did NOT find a match:")
                                best_match = process.extractOne(
                                    name, name_options)
                                if debug:
                                    print(best_match)
                                    print("Best name: {}"
                                          .format(best_match[0]))
                                    print("Match confidence: {}"
                                          .format(best_match[1]))
                                    print("index={}"
                                          .format(group_1[
                                              group_1['Name'] == best_match[
                                                                  0]].index))
                                if best_match[1] > 50:
                                    i = group_1[
                                        group_1['Name'] == best_match[0]].index
                                    # Don't replace a time if one exists
                                    if pd.isnull(
                                            ol_running.loc[i][
                                                            'Time'].tolist()):
                                        ol_running.loc[i,
                                                       'Merged_name'] = name
                                        ol_running.loc[i,
                                                       'Ratio'] = best_match[1]
                                        ol_running.loc[i,
                                                       'Time'] = group_2.loc[
                                            group_2['Name'] == name][
                                                              'Time'].tolist()
In [114]:
event_categories = ['Track_Flat', 'Hurdles', 'Road', 'Steeplechase']
merge_times(ol_tf_running, event_categories)

Check how well the fuzzy matching algorithm is doing:

In [115]:
ol_running.loc[~ol_running['Time'].isnull()].head()
Out[115]:
ID Name Merged_name Ratio RawName Gender Age Height Weight Team ... Year Season City Sport Hurdles Road Steeplechase Track_Flat Medal Time
191 86 jos manuel abascal gmez josé manuel abascal 95.0 Jos Manuel Abascal Gmez M 26.0 182.0 67.0 Spain ... 1984 Summer Los Angeles Athletics 0 0 0 1500 B 00:03:34.300000
655 379 addis abebe addis abebe NaN Addis Abebe M 21.0 160.0 50.0 Ethiopia ... 1992 Summer Barcelona Athletics 0 0 0 10000 B 00:28:00.070000
720 411 gezahgne abera gezahegne abera 97.0 Gezahgne Abera M 22.0 166.0 58.0 Ethiopia ... 2000 Summer Sydney Athletics 0 42195 0 0 G 02:10:11
747 428 elvan abeylegesse elvan abeylegesse NaN Elvan Abeylegesse F 25.0 159.0 40.0 Turkey ... 2008 Summer Beijing Athletics 0 0 0 5000 S 00:15:42.740000
915 519 harold maurice abrahams harold abrahams 86.0 Harold Maurice Abrahams M 24.0 183.0 75.0 Great Britain ... 1924 Summer Paris Athletics 0 0 0 100 G 00:00:10.600000

5 rows × 22 columns

From a visual scan of the three columns corresponding to athlete name, it looks like the fuzzy matching algorithm is doing a good job of finding the correct names. The matching algorithm uses a threshold value of 50 for the match ratio. As a further check, examine the matches with the lowest match ratio:

In [116]:
ol_running[ol_running['Ratio'] < 70]
Out[116]:
ID Name Merged_name Ratio RawName Gender Age Height Weight Team ... Year Season City Sport Hurdles Road Steeplechase Track_Flat Medal Time
83230 42280 lorraine graham lorraine fenton 66.0 Lorraine Graham (-Fenton) F 27.0 174.0 59.0 Jamaica ... 2000 Summer Sydney Athletics 0 0 0 400 S 00:00:49.580000
126564 63825 katharina anna krau käthe krauss 52.0 Katharina Anna "Kthe" Krau F 29.0 176.0 72.0 Germany ... 1936 Summer Berlin Athletics 0 0 0 100 B 00:00:11.900000
169845 85373 yuliya viktorovna nesterenko yuliya nestsiarenka 65.0 Yuliya Viktorovna Nesterenko (Bartsevich-) F 25.0 176.0 62.0 Belarus ... 2004 Summer Athina Athletics 0 0 0 100 G 00:00:10.930000
209437 105164 flix snchez marcelo felix sanchez 69.0 Flix Snchez Marcelo M 26.0 178.0 64.0 Dominican Republic ... 2004 Summer Athina Athletics 400 0 0 0 G 00:00:47.630000
209439 105164 flix snchez marcelo felix sanchez 69.0 Flix Snchez Marcelo M 34.0 178.0 64.0 Dominican Republic ... 2012 Summer London Athletics 400 0 0 0 G 00:00:47.630000
225056 113067 son gi-jeong kitei son 54.0 Son Gi-Jeong M 21.0 170.0 60.0 Japan ... 1936 Summer Berlin Athletics 0 42195 0 0 G 02:29:19
260600 130418 erik wilhelm wiln erkka wilen 66.0 Erik Wilhelm "Erkka" Wiln M 25.0 180.0 73.0 Finland ... 1924 Summer Paris Athletics 400 0 0 0 S 00:00:53.800000

7 rows × 22 columns

So there are only seven values with a match ratio below 70, and they all look correct. So the matching algorithm seems to be working well.

In [117]:
ol_running.loc[~ol_running['Time'].isnull()].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1177 entries, 191 to 270000
Data columns (total 22 columns):
ID              1177 non-null int64
Name            1177 non-null object
Merged_name     1177 non-null object
Ratio           815 non-null float64
RawName         1177 non-null object
Gender          1177 non-null object
Age             1171 non-null float64
Height          1123 non-null float64
Weight          1121 non-null float64
Team            1177 non-null object
NOC             1177 non-null object
Games           1177 non-null object
Year            1177 non-null int16
Season          1177 non-null object
City            1177 non-null object
Sport           1177 non-null object
Hurdles         1177 non-null int16
Road            1177 non-null int32
Steeplechase    1177 non-null int16
Track_Flat      1177 non-null int16
Medal           1177 non-null object
Time            1177 non-null object
dtypes: float64(4), int16(4), int32(1), int64(1), object(12)
memory usage: 179.3+ KB

So this method has merged in 1177 time data fields. Next, merge in times from the top_running data frame. This is a little more complicated because it is necessary to group on events marked as True in the 'Olympics' feature of this data frame to screen out other performances by the same athlete in the same year. In addition, some athletes may run several heats and a final in a single Games. Therefore, it is necessary to use the time from the race with the latest date, and within the period of the Games in question. If there turn out to be more than one (e.g., if a final and a heat were run on the same day) then we choose one arbitrarily. This is not a huge problem, since we are attempting to relate performances to height, weight and age, and those factors will not change within one day anyway.

The ol_tf_running data set contained only medal-winning performances, so it was (almost) guaranteed that there would be a corresponding row in the ol_running data set. The top_running data set differs from the ol_tf_running data set in that it contains many non-medal winning performances. Therefore, it's not possible to use the 'Medal' field to group the performances and use that to help match them. This means there is a larger scope for false positives, where the fuzzy matching algorithm wrongly identifies two similar names as a match. To help solve this, the match ratio threshold is raised from 50 to 80 in this function. It's not straightforward to combine this extra complexity into the existing merge_times function, so write a new function to handle this.

In [118]:
def merge_times_ext(df, event_categories, debug=False):
    """
    Helper function to merge times from one dataset into
    the ol_running dataset.
    Event, year, gender, and athlete name are used as inputs
    to match athlete data from one data frame to the same
    athlete's performance in the other data frame.
    Names are matched using fuzzy string matching.

    Input parameters:
    df               - data frame to merge
    event_categories - list of categories of events
    debug            - True/False flag to switch debug on/off.

    Returns:
    None
    """
    for category in event_categories:
        if debug:
            print(category)
        ol_running_groups = ol_running.groupby([category, 'Year', 'Gender'])
        df_groups = df.groupby([category, 'Year', 'Gender', 'Olympics'])
        events = ol_running[category].unique().tolist()
        events.remove(0)
        for event in events:
            if debug:
                print(event)
            for gender in ol_running['Gender'].unique().tolist():
                if debug:
                    print(gender)
                for year in ol_running['Year'].unique().tolist():
                    if debug:
                        print(year)
                    try:
                        group_1 = ol_running_groups.get_group(
                            (event, year, gender))
                    except KeyError:
                        if debug:
                            print("No results in ol_running_groups")
                        continue
                    try:
                        group_2 = df_groups.get_group(
                            (event, year, gender, True))
                    except KeyError:
                        if debug:
                            print("No results in df_groups")
                        continue
                    name_options = group_1['Name'].tolist()
                    for name in group_2['Name']:
                        find_result = group_1['Name'].str.find(name)
                        i = find_result[find_result > -1].index
                        if debug:
                            print(i)
                        if(i.any()):
                            if debug:
                                print("str.find found a match: {}"
                                      .format(name))
                            # Don't replace a time if one exists
                            if pd.isnull(ol_running.loc[i]['Time'].tolist()):
                                latest_race_date = group_2.loc[
                                    (group_2['Name'] == name, 'Date')].max()
                                ol_running.loc[i, 'Time'] = group_2.loc[
                                    (group_2['Name'] == name) &
                                    (group_2['Date'] == latest_race_date)][
                                                        'Time'].tolist()[0]
                        else:
                            if debug:
                                print("str.find did NOT find a match:")
                            best_match = process.extractOne(name, name_options)
                            if debug:
                                print(best_match)
                                print("Best name: {}"
                                      .format(best_match[0]))
                                print("Match confidence: {}"
                                      .format(best_match[1]))
                                print("index={}"
                                      .format(group_1[group_1[
                                          'Name'] == best_match[0]].index))
                            if best_match[1] > 80:
                                i = group_1[group_1[
                                    'Name'] == best_match[0]].index
                                # Don't replace a time if one exists
                                if pd.isnull(
                                        ol_running.loc[i]['Time'].tolist()):
                                    ol_running.loc[i, 'Merged_name'] = name
                                    ol_running.loc[i, 'Ratio'] = best_match[1]
                                    latest_race_date = group_2.loc[(group_2[
                                        'Name'] == name, 'Date')].max()
                                    ol_running.loc[i, 'Time'] = group_2.loc[
                                        (group_2['Name'] == name) &
                                        (group_2['Date'] == latest_race_date)][
                                                            'Time'].tolist()[0]
In [119]:
event_categories = ['Track_Flat', 'Road']
merge_times_ext(top_running, event_categories)
In [120]:
ol_running.loc[~ol_running['Time'].isnull()].head()
Out[120]:
ID Name Merged_name Ratio RawName Gender Age Height Weight Team ... Year Season City Sport Hurdles Road Steeplechase Track_Flat Medal Time
191 86 jos manuel abascal gmez josé manuel abascal 95.0 Jos Manuel Abascal Gmez M 26.0 182.0 67.0 Spain ... 1984 Summer Los Angeles Athletics 0 0 0 1500 B 00:03:34.300000
559 321 ahmad hassan abdullah abdullah ahmad hassan 95.0 Ahmad Hassan Abdullah M 27.0 164.0 55.0 Qatar ... 2008 Summer Beijing Athletics 0 0 0 10000 NaN 00:27:23.750000
655 379 addis abebe addis abebe NaN Addis Abebe M 21.0 160.0 50.0 Ethiopia ... 1992 Summer Barcelona Athletics 0 0 0 10000 B 00:28:00.070000
720 411 gezahgne abera gezahegne abera 97.0 Gezahgne Abera M 22.0 166.0 58.0 Ethiopia ... 2000 Summer Sydney Athletics 0 42195 0 0 G 02:10:11
745 428 elvan abeylegesse NaN NaN Elvan Abeylegesse F 21.0 159.0 40.0 Turkey ... 2004 Summer Athina Athletics 0 0 0 1500 NaN 00:04:00.670000

5 rows × 22 columns

In [121]:
ol_running.loc[~ol_running['Time'].isnull()].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1547 entries, 191 to 270808
Data columns (total 22 columns):
ID              1547 non-null int64
Name            1547 non-null object
Merged_name     1421 non-null object
Ratio           1059 non-null float64
RawName         1547 non-null object
Gender          1547 non-null object
Age             1541 non-null float64
Height          1489 non-null float64
Weight          1488 non-null float64
Team            1547 non-null object
NOC             1547 non-null object
Games           1547 non-null object
Year            1547 non-null int16
Season          1547 non-null object
City            1547 non-null object
Sport           1547 non-null object
Hurdles         1547 non-null int16
Road            1547 non-null int32
Steeplechase    1547 non-null int16
Track_Flat      1547 non-null int16
Medal           1208 non-null object
Time            1547 non-null object
dtypes: float64(4), int16(4), int32(1), int64(1), object(12)
memory usage: 235.7+ KB

Merging the second data set in has increased the number of rows with a time by a few hundred.

4. Data Analysis and Visualisation

4.1. Question 1: How have athletes' performances changed through history?

To investigate this plot athletes' results (i.e. times) against the date of the performance. This will be done individually for each event and separately for each gender. The plots use colour to identify Olympic medal winning performances (see the key). Both the ol_tf_running and top_running datasets are used for this. The analysis will follow in section 5.1.

In [122]:
# Group by gender
top_running_gender_groups = top_running.groupby(['Gender', 'Top 10'])
top_running_m = top_running_gender_groups.get_group(('M', True))
top_running_f = top_running_gender_groups.get_group(('F', True))
ol_tf_running_gender_groups = ol_tf_running.groupby('Gender')
ol_tf_running_m = ol_tf_running_gender_groups.get_group('M')
ol_tf_running_f = ol_tf_running_gender_groups.get_group('F')
In [123]:
def build_graph_labels(gender, category, event, characteristic=None):
    """
    build_graph_labels

    Helper function to create strings to use in constructing the graph title

    Input parameters:
    gender - athlete gender group
    category - type of event
    event - specific distance
    characteristic - athlete characteristic, default None

    Returns:
    gender_label - Readable gender string
    event_label - Readable event name string
    category_label - Readable event category string
    unit - Unit for the characteristic to plot

    """
    if gender == 'M':
        gender_label = "Male"
    else:
        gender_label = "Female"
    if category == 'Road':
        if event == 42195:
            event_label = 'Marathon'
        if event == 21098:
            event_label = 'Half Marathon'
        category_label = 'Road Running'
    if category == 'Track_Flat':
        event_label = str(event)+'m'
        category_label = 'Track (Flat)'
    if category == 'Hurdles':
        event_label = str(event)+'m'+' Hurdles'
        category_label = 'Hurdles'
    if category == 'Steeplechase':
        event_label = str(event)+'m'+' Steeplechase'
        category_label = 'Steeplechase'
    if characteristic == 'Height':
        unit = 'cm'
    elif characteristic == 'Weight':
        unit = 'kg'
    elif characteristic == 'Age':
        unit = 'years'
    elif characteristic == 'BMI':
        unit = 'm/kg*kg'
    else:
        unit = None

    return gender_label, event_label, category_label, unit
In [124]:
def update_min_max(x_series, y_series, x_min, x_max, y_min, y_max):
    """
    Helper to update minimum and maximum values of the x and y series
    Input parameters:
    x_series - A datetime.date series
    y_series - A datetime.time list
    x_min    - earliest date found so far
    x_max    - latest date found so far
    y_min    - shortest time found so far
    y_max    - longest time found so far

    Returns:
    x_min - Earliest date
    x_max - Latest date
    y_min - Smallest time
    y_max - Largest time

    """
    if x_series.min() < x_min:
        x_min = x_series.min()
    if x_series.max() > x_max:
        x_max = x_series.max()
    if min(y_series) < y_min:
        y_min = min(y_series)
    if max(y_series) > y_max:
        y_max = max(y_series)

    return x_min, x_max, y_min, y_max
In [125]:
def plot_times(event_categories, debug=False):
    """
    Helper function to plot finish times for athletes across all events.

    Input parameters:
    event_categories - list of categories of events
    debug            - True/False flag switching debug on/off.

    Returns:
    None
    """
    global graph_number
    summary_strings = []
    x_min = datetime(2020, 1, 1, 0, 0, 0, 0)
    x_max = datetime(1896, 1, 1, 0, 0, 0, 0)
    y_min = time(23, 0, 0)
    y_max = time(0, 0, 0)
    top_running_data_present = True
    ol_tf_running_data_present = True
    for category in event_categories:
        events = ol_running[category].unique().tolist()
        events.remove(0)
        events.sort()
        if category == 'Road':
            events.append(21098)
        for event in events:
            if debug:
                print("Category = {}, Event={}".format(category, event))
            for gender in ol_tf_running_gender_groups.groups.keys():
                if gender == 'M':
                    top_running_group = top_running_m
                    ol_tf_running_group = ol_tf_running_m
                else:
                    top_running_group = top_running_f
                    ol_tf_running_group = ol_tf_running_f
                plt.figure(figsize=(18, 9))
                # Plot top running time data, if it exists
                try:
                    x_series = top_running_group[
                        top_running_group[category] == event]['Date']
                    y_series = list(top_running_group[
                        top_running_group[category] == event]['Time'])
                    plt.scatter(x_series, y_series, color='b',
                                label='Top 10 results in year')

                    x_min, x_max, y_min, y_max = update_min_max(
                                                        x_series,
                                                        y_series,
                                                        x_min,
                                                        x_max,
                                                        y_min,
                                                        y_max)
                except KeyError:
                    if debug:
                        print("No data from top running times for this event.")
                    top_running_data_present = False
                # Plot each olympic medal colour, if data exists for this event
                if ol_tf_running_group[(ol_tf_running_group[
                                    category] == event)].shape[0] != 0:
                    x_series = pd.to_datetime(
                        ol_tf_running_group
                        [(ol_tf_running_group[category] == event) &
                         (ol_tf_running_group['Medal'] == 'G')]['Year'],
                        format='%Y')
                    y_series = list(ol_tf_running_group[
                        (ol_tf_running_group[category] == event) &
                        (ol_tf_running_group['Medal'] == 'G')]['Time'])
                    plt.scatter(x_series, y_series, color='gold',
                                label='Olympic gold medal')
                    x_min, x_max, y_min, y_max = update_min_max(
                                                                x_series,
                                                                y_series,
                                                                x_min,
                                                                x_max,
                                                                y_min,
                                                                y_max)
                    x_series = pd.to_datetime(ol_tf_running_group[
                        (ol_tf_running_group[category] == event) &
                        (ol_tf_running_group['Medal'] == 'S')]['Year'],
                        format='%Y')
                    y_series = list(ol_tf_running_group[
                        (ol_tf_running_group[category] == event) &
                        (ol_tf_running_group['Medal'] == 'S')]['Time'])
                    plt.scatter(x_series, y_series, color='silver',
                                label='Olympic silver medal')
                    x_min, x_max, y_min, y_max = update_min_max(
                                                                x_series,
                                                                y_series,
                                                                x_min,
                                                                x_max,
                                                                y_min,
                                                                y_max)
                    x_series = pd.to_datetime(ol_tf_running_group[
                        (ol_tf_running_group[category] == event) &
                        (ol_tf_running_group['Medal'] == 'B')]['Year'],
                        format='%Y')
                    y_series = list(ol_tf_running_group[
                        (ol_tf_running_group[category] == event) &
                        (ol_tf_running_group['Medal'] == 'B')]['Time'])

                    plt.scatter(x_series, y_series, color='brown',
                                label='Olympic bronze medal')
                    x_min, x_max, y_min, y_max = update_min_max(
                                                                x_series,
                                                                y_series,
                                                                x_min,
                                                                x_max,
                                                                y_min,
                                                                y_max)
                else:
                    ol_tf_running_data_present = False
                    if debug:
                        print("No data from Olympic data set for this event.")
                # Only plot if there is some data
                if ((ol_tf_running_data_present is True) or
                        (top_running_data_present is True)):
                    # Construct the graph title and axis labels
                    (gender_label,
                     event_label,
                     category_label,
                     unit) = build_graph_labels(
                                                gender,
                                                category,
                                                event)
                    plt.xlabel('Year')
                    plt.ylabel('Time')
                    plt.title("Graph {0}: Variation in {1} {2} Times"
                              .format(graph_number, gender_label, event_label))
                    plt.legend()
                    plt.show()
                    graph_number += 1

                    # Print information about variation in results in graph
                    # Calculate difference between quickest and slowest times
                    y_delta = datetime.combine(
                        date.today(), y_max) - datetime.combine(
                        date.today(), y_min)

                    # Calculate difference as a proportion of the slowest time
                    y_proportion = y_delta.total_seconds() / (
                        (datetime.combine(date.today(), y_min) -
                         datetime.combine(date.today(),
                         time(0, 0, 0))).total_seconds())
                    # Calculate length of time we have plotted data over
                    x_delta = x_max - x_min

                    summary_strings.append(
                        "Top times in {0} {1} span a range of {2} seconds ({3:.2f}%) in {4:.1f} years of history"
                        .format(
                                gender_label,
                                event_label,
                                y_delta.total_seconds(),
                                y_proportion*100,
                                x_delta.days/365))
                    print("Top times in {0} {1} span a range of {2} seconds ({3:.2f}%) in {4:.1f} years of history"
                          .format(
                                    gender_label,
                                    event_label,
                                    y_delta.total_seconds(),
                                    y_proportion*100,
                                    x_delta.days/365))
                # Reset
                top_running_data_present = True
                ol_tf_running_data_present = True
                x_min = datetime(2020, 1, 1, 0, 0, 0, 0)
                x_max = datetime(1896, 1, 1, 0, 0, 0, 0)
                y_min = time(23, 0, 0)
                y_max = time(0, 0, 0)

    print("Summary")
    for summary in summary_strings:
        print(summary)
In [126]:
# A label for the graphs plotted
graph_number = 1

# Plot graphs for all events
event_categories = ['Track_Flat', 'Steeplechase', 'Hurdles', 'Road']
plot_times(event_categories)
Top times in Female 100m span a range of 1.71 seconds (16.30%) in 89.4 years of history
Top times in Male 100m span a range of 3.02 seconds (31.52%) in 121.4 years of history
Top times in Female 200m span a range of 3.86 seconds (18.09%) in 69.4 years of history
Top times in Male 200m span a range of 3.71 seconds (19.33%) in 117.4 years of history
Top times in Female 400m span a range of 5.8 seconds (12.18%) in 53.6 years of history
Top times in Male 400m span a range of 12.57 seconds (29.21%) in 121.4 years of history
Top times in Female 800m span a range of 23.52 seconds (20.76%) in 89.4 years of history
Top times in Male 800m span a range of 47.09 seconds (46.67%) in 121.4 years of history
Top times in Female 1500m span a range of 20.67 seconds (8.98%) in 45.4 years of history
Top times in Male 1500m span a range of 70.0 seconds (33.98%) in 121.4 years of history
Top times in Female 5000m span a range of 92.97 seconds (10.92%) in 32.9 years of history
Top times in Male 5000m span a range of 155.65 seconds (20.55%) in 104.6 years of history
Top times in Female 10000m span a range of 152.01 seconds (8.65%) in 34.8 years of history
Top times in Male 10000m span a range of 436.63 seconds (29.01%) in 105.4 years of history
Top times in Female 3000m Steeplechase span a range of 11.07 seconds (2.05%) in 8.0 years of history
Top times in Male 3000m Steeplechase span a range of 162.92 seconds (33.71%) in 96.1 years of history
Top times in Female 100m Hurdles span a range of 0.71 seconds (5.75%) in 44.0 years of history
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
Top times in Male 110m Hurdles span a range of 4.79 seconds (37.10%) in 120.1 years of history
Top times in Female 400m Hurdles span a range of 2.77 seconds (5.26%) in 32.0 years of history
Top times in Male 400m Hurdles span a range of 11.52 seconds (24.63%) in 116.1 years of history
Top times in Female Marathon span a range of 1114.0 seconds (13.71%) in 36.5 years of history
Top times in Male Marathon span a range of 6276.0 seconds (85.08%) in 121.4 years of history
Top times in Female Half Marathon span a range of 287.0 seconds (7.37%) in 34.5 years of history
Top times in Male Half Marathon span a range of 148.0 seconds (4.22%) in 30.8 years of history
Summary
Top times in Female 100m span a range of 1.71 seconds (16.30%) in 89.4 years of history
Top times in Male 100m span a range of 3.02 seconds (31.52%) in 121.4 years of history
Top times in Female 200m span a range of 3.86 seconds (18.09%) in 69.4 years of history
Top times in Male 200m span a range of 3.71 seconds (19.33%) in 117.4 years of history
Top times in Female 400m span a range of 5.8 seconds (12.18%) in 53.6 years of history
Top times in Male 400m span a range of 12.57 seconds (29.21%) in 121.4 years of history
Top times in Female 800m span a range of 23.52 seconds (20.76%) in 89.4 years of history
Top times in Male 800m span a range of 47.09 seconds (46.67%) in 121.4 years of history
Top times in Female 1500m span a range of 20.67 seconds (8.98%) in 45.4 years of history
Top times in Male 1500m span a range of 70.0 seconds (33.98%) in 121.4 years of history
Top times in Female 5000m span a range of 92.97 seconds (10.92%) in 32.9 years of history
Top times in Male 5000m span a range of 155.65 seconds (20.55%) in 104.6 years of history
Top times in Female 10000m span a range of 152.01 seconds (8.65%) in 34.8 years of history
Top times in Male 10000m span a range of 436.63 seconds (29.01%) in 105.4 years of history
Top times in Female 3000m Steeplechase span a range of 11.07 seconds (2.05%) in 8.0 years of history
Top times in Male 3000m Steeplechase span a range of 162.92 seconds (33.71%) in 96.1 years of history
Top times in Female 100m Hurdles span a range of 0.71 seconds (5.75%) in 44.0 years of history
Top times in Male 110m Hurdles span a range of 4.79 seconds (37.10%) in 120.1 years of history
Top times in Female 400m Hurdles span a range of 2.77 seconds (5.26%) in 32.0 years of history
Top times in Male 400m Hurdles span a range of 11.52 seconds (24.63%) in 116.1 years of history
Top times in Female Marathon span a range of 1114.0 seconds (13.71%) in 36.5 years of history
Top times in Male Marathon span a range of 6276.0 seconds (85.08%) in 121.4 years of history
Top times in Female Half Marathon span a range of 287.0 seconds (7.37%) in 34.5 years of history
Top times in Male Half Marathon span a range of 148.0 seconds (4.22%) in 30.8 years of history

The analysis for the question "How have athletes' performances changed through history?" can be found in section 5.1.

4.2. Question 2: How have athletes' characteristics changed through history?

This will be examined by plotting the mean value of each of the four athlete characteristics (height, weight, age, BMI) against year of competition. Multiple events are plotted on the same axis for ease of comparison. The Numpy polyfit() method is used to plot a best fit line for each event.

The analysis can be found in section 5.2

In [127]:
def plot_athlete_characteristics(event_categories,
                                 characteristics,
                                 debug=False):
    """
    Helper function to plot athlete characteristics
    (height, weight, age, BMI) against year of competition.
    Multiple events are plotted on the same axis for comparison.
    A best fit line is added for each event in the plot.

    Input parameters:
    event_categories - list of categories of events
    characteristics  - list of athlete characteristics to plot
    debug            - True/False flag to switch debug on/off.

    Returns:
    None
    """
    meanvals = []
    years = []
    global graph_number
    for c in characteristics:
        for category in event_categories:
            if debug:
                print(category)
            ol_running_groups = ol_running.groupby([category, 'Gender'])
            events = ol_running[category].unique().tolist()
            events.remove(0)
            for gender in ol_running['Gender'].unique().tolist():
                plt.figure(figsize=(18, 9))
                if debug:
                    print(gender)
                for event in events:
                    if debug:
                        print(event)
                    try:
                        ol_running_group = ol_running_groups.get_group(
                            (event, gender))
                    except KeyError:
                        if debug:
                            print("No results in ol_running_groups")
                        continue
                    for year in ol_running_group['Year'].unique().tolist():
                        meanvals.append(ol_running_group[
                            ol_running_group['Year'] == year][c].mean())
                        years.append(year)
                    plt.scatter(years, meanvals, label=str(event)+'m')
                    # Remove NaN values for polyfit()
                    nullvals = np.isnan(meanvals)
                    for i in np.where(nullvals)[0]:
                        meanvals.pop(i)
                        years.pop(i)
                    z = np.polyfit(years, meanvals, 1)
                    p = np.poly1d(z)
                    plb.plot(years, p(years))
                    meanvals = []
                    years = []
                # Construct the graph title and axis labels
                (gender_label,
                 event_label,
                 category_label,
                 unit) = build_graph_labels(gender, category, event, c)
                plt.xlabel('Year')
                plt.ylabel(c+'({})'.format(unit))
                plt.title("Graph {0}: Variation in Mean {1} of {2} {3} Olympic Athletes Through History"
                          .format(
                                    graph_number,
                                    c,
                                    gender_label,
                                    category_label))
                plt.legend()
                plt.show()
                graph_number += 1

Calculate body mass index (BMI):

In [128]:
ol_running.insert(
    loc=ol_running.columns.get_loc('Weight'), column='BMI', value=0)
ol_running['BMI'] = ol_running['Weight'] / ((ol_running['Height'] / 100)**2)
ol_running.head()
Out[128]:
ID Name Merged_name Ratio RawName Gender Age Height BMI Weight ... Year Season City Sport Hurdles Road Steeplechase Track_Flat Medal Time
26 8 cornelia aalten NaN NaN Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN NaN ... 1932 Summer Los Angeles Athletics 0 0 0 100 NaN None
98 34 jamale aarrass NaN NaN Jamale (Djamel-) Aarrass (Ahrass-) M 30.0 187.0 21.733535 76.0 ... 2012 Summer London Athletics 0 0 0 1500 NaN None
148 55 antonio abadia beci NaN NaN Antonio Abadia Beci M 26.0 170.0 22.491349 65.0 ... 2016 Summer Rio de Janeiro Athletics 0 0 0 5000 NaN None
190 86 jos manuel abascal gmez NaN NaN Jos Manuel Abascal Gmez M 22.0 182.0 20.227026 67.0 ... 1980 Summer Moskva Athletics 0 0 0 1500 NaN None
191 86 jos manuel abascal gmez josé manuel abascal 95.0 Jos Manuel Abascal Gmez M 26.0 182.0 20.227026 67.0 ... 1984 Summer Los Angeles Athletics 0 0 0 1500 B 00:03:34.300000

5 rows × 23 columns

In [129]:
characteristics = ['Height', 'Weight', 'Age', 'BMI']
plot_athlete_characteristics(event_categories, characteristics)

The analysis can be found in section 5.2

This will be examined by plotting each of the four athlete characteristics against their performance (i.e., time) for each event and gender. Colours are used to indicate medal-winning performances as indicated by the key.

The analysis can be found in section 5.3

In [130]:
ol_running_gender_groups = ol_running.groupby('Gender')
ol_running_m = ol_running_gender_groups.get_group('M')
ol_running_f = ol_running_gender_groups.get_group('F')
In [131]:
def plot_time_vs_characteristics(event_categories,
                                 characteristics,
                                 debug=False):
    """
    Helper function to plot athlete characteristics
    (height, weight, age, BMI) against time.

    Input parameters:
    event_categories - list of categories of events
    characteristics  - list of athlete characteristics to plot
    debug            - True/False flag to switch debug on/off.

    Returns:
    None
    """
    global graph_number
    data_present = True
    for c in characteristics:
        for category in event_categories:
            events = ol_running[category].unique().tolist()
            events.remove(0)
            events.sort()
            for event in events:
                if debug:
                    print("Category = {}, Event={}".format(category, event))
                for gender in ol_running_gender_groups.groups.keys():
                    if gender == 'M':
                        ol_running_group = ol_running_m
                    else:
                        ol_running_group = ol_running_f
                    plt.figure(figsize=(18, 9))
                    # Plot each olympic medal colour, if data exists
                    if ol_running_group[(ol_running_group[
                                    category] == event)].shape[0] != 0:
                        # Plot vertical lines showing mean, std dev
                        meanval = ol_running_group[
                            (ol_running_group[category] == event) &
                            (ol_running_group['Time'].notna())][c].mean()
                        stddev = ol_running_group[
                            (ol_running_group[category] == event) &
                            (ol_running_group['Time'].notna())][c].std()
                        representative_max_time = max(
                            list(ol_running_group[
                                (ol_running_group[category] == event) &
                                (ol_running_group['Medal'] == 'B') &
                                (ol_running_group['Time'].notna())]['Time']))
                        representative_min_time = min(
                            list(ol_running_group[
                                (ol_running_group[category] == event) &
                                (ol_running_group['Medal'] == 'B') &
                                (ol_running_group['Time'].notna())]['Time']))
                        plt.plot([meanval, meanval],
                                 [representative_min_time,
                                  representative_max_time],
                                 color='red',
                                 label='Mean value of characteristic')
                        plt.plot([meanval + stddev,
                                  meanval + stddev],
                                 [representative_min_time,
                                  representative_max_time],
                                 color='pink',
                                 label='Mean +/- standard deviation')
                        plt.plot([meanval - stddev,
                                  meanval - stddev],
                                 [representative_min_time,
                                  representative_max_time],
                                 color='pink',
                                 label='Mean +/- standard deviation')

                        plt.scatter(ol_running_group[
                            (ol_running_group[category] == event) &
                            (ol_running_group['Medal'] == 'G') &
                            (ol_running_group['Time'].notna())][c],
                            list(ol_running_group[
                                (ol_running_group[category] == event) &
                                (ol_running_group['Medal'] == 'G') &
                                (ol_running_group['Time'].notna())]['Time']),
                            color='gold',
                            label='Olympic gold medal')
                        plt.scatter(ol_running_group[
                            (ol_running_group[category] == event) &
                            (ol_running_group['Medal'] == 'S') &
                            (ol_running_group['Time'].notna())][c],
                            list(ol_running_group[
                                (ol_running_group[category] == event) &
                                (ol_running_group['Medal'] == 'S') &
                                (ol_running_group['Time'].notna())]['Time']),
                            color='silver',
                            label='Olympic silver medal')
                        plt.scatter(ol_running_group[
                            (ol_running_group[category] == event) &
                            (ol_running_group['Medal'] == 'B') &
                            (ol_running_group['Time'].notna())][c],
                            list(ol_running_group[
                                (ol_running_group[category] == event) &
                                (ol_running_group['Medal'] == 'B') &
                                (ol_running_group['Time'].notna())]['Time']),
                            color='brown',
                            label='Olympic bronze medal')
                        plt.scatter(ol_running_group[
                            (ol_running_group[category] == event) &
                            (ol_running_group['Medal'].isnull()) &
                            (ol_running_group['Time'].notna())][c],
                            list(ol_running_group[
                                (ol_running_group[category] == event) &
                                (ol_running_group['Medal'].isnull()) &
                                (ol_running_group['Time'].notna())]['Time']),
                            color='blue',
                            label='No medal')
                    else:
                        data_present = False
                        if debug:
                            print("No data from Olympic data set for this event.")
                    # Only plot if there is some data
                    if data_present is True:
                        # Construct the graph title and axis labels
                        (gender_label,
                         event_label,
                         category_label,
                         unit) = build_graph_labels(
                                                    gender,
                                                    category,
                                                    event,
                                                    c)
                        plt.xlabel(c+'({})'.format(unit))
                        plt.ylabel('Time')
                        plt.title("Graph {0}: Variation in {1} {2} Times with {3}"
                                  .format(
                                            graph_number,
                                            gender_label,
                                            event_label,
                                            c))
                        plt.legend()
                        plt.show()
                        graph_number += 1
                    data_present = True
In [132]:
# Plot graphs for all events
event_categories = ['Track_Flat', 'Steeplechase', 'Hurdles', 'Road']
plot_time_vs_characteristics(event_categories, characteristics)
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>

Now display the same set of results but colour coded to show the 20-year time period they fall into. This is to look for any relationship between the characterisitc and the results in a particular year group.

In [133]:
# Create a set of year groups throughout Olympic history
year_groups = np.arange(1896, 2020, 20)
year_groups[-1] += 1 # To include 2016 Games
year_groups
Out[133]:
array([1896, 1916, 1936, 1956, 1976, 1996, 2017])
In [134]:
def plot_time_vs_characteristics_time_groups(event_categories,
                                             characteristics,
                                             debug=False):
    """
    Helper function to plot athlete characteristics
    (height, weight, age, BMI) against time.
    This function uses colour codes to show the 20-year
    time period into which a performaance falls.

    Input parameters:
    event_categories - list of categories of events
    characteristics  - list of athlete characteristics to plot
    debug            - True/False flag to switch debug on/off.

    Returns:
    None
    """
    global graph_number
    data_present = True
    for c in characteristics:
        meanval_list_m = []
        stddev_list_m = []
        event_list_m = []
        meanval_list_f = []
        stddev_list_f = []
        event_list_f = []
        for category in event_categories:
            events = ol_running[category].unique().tolist()
            events.remove(0)
            events.sort()
            for event in events:
                if debug:
                    print("Category = {}, Event={}".format(category, event))
                for gender in ol_running_gender_groups.groups.keys():
                    if gender == 'M':
                        ol_running_group = ol_running_m
                    else:
                        ol_running_group = ol_running_f
                    plt.figure(figsize=(18, 9))
                    # Plot
                    if ol_running_group[(ol_running_group[
                            category] == event)].shape[0] != 0:
                        for y in range(len(year_groups)-1):
                            plt.scatter(ol_running_group[
                                (ol_running_group[category] == event) &
                                (ol_running_group['Year'] >= year_groups[y]) &
                                (ol_running_group['Year'] < year_groups[y+1]) &
                                (ol_running_group['Time'].notna())][c],
                                list(ol_running_group[
                                    (ol_running_group[category] == event) &
                                    (ol_running_group['Year']
                                     >= year_groups[y]) &
                                    (ol_running_group['Year']
                                     < year_groups[y+1]) &
                                    (ol_running_group[
                                        'Time'].notna())]['Time']),
                                label='{0} to {1}'
                                .format(year_groups[y], year_groups[y+1]-1))

                        # Calculate mean value of the characteristic 1996-2016
                        meanval = ol_running_group[
                            (ol_running_group[category] == event) &
                            (ol_running_group['Year'] >= year_groups[y]) &
                            (ol_running_group['Year'] < year_groups[y+1]) &
                            (ol_running_group['Time'].notna())][c].mean()
                        stddev = ol_running_group[
                            (ol_running_group[category] == event) &
                            (ol_running_group['Year'] >= year_groups[y]) &
                            (ol_running_group['Year'] < year_groups[y+1]) &
                            (ol_running_group['Time'].notna())][c].std()
                        if gender == 'M':
                            if category == 'Track_Flat':
                                event_list_m.append(str(event)+'m')
                            if category == 'Hurdles':
                                event_list_m.append(str(event) + 'm Hurdles')
                            if category == 'Steeplechase':
                                event_list_m.append(str(event)
                                                    + 'm Steeplechase')
                            if category == 'Road':
                                event_list_m.append('Marathon')
                            meanval_list_m.append(meanval)
                            stddev_list_m.append(stddev)
                        if gender == 'F':
                            if category == 'Track_Flat':
                                event_list_f.append(str(event)+'m')
                            if category == 'Hurdles':
                                event_list_f.append(str(event) + 'm Hurdles')
                            if category == 'Steeplechase':
                                event_list_f.append(
                                    str(event) + 'm Steeplechase')
                            if category == 'Road':
                                event_list_f.append('Marathon')
                            meanval_list_f.append(meanval)
                            stddev_list_f.append(stddev)
                        representative_max_time = max(
                            list(ol_running_group[
                                (ol_running_group[category] == event) &
                                (ol_running_group['Year'] >= year_groups[y]) &
                                (ol_running_group['Year'] < year_groups[y+1]) &
                                (ol_running_group['Time'].notna())]['Time']))
                        representative_min_time = min(
                            list(ol_running_group[
                                (ol_running_group[category] == event) &
                                (ol_running_group['Year'] >= year_groups[y]) &
                                (ol_running_group['Year'] < year_groups[y+1]) &
                                (ol_running_group['Time'].notna())]['Time']))
                        plt.plot([meanval, meanval],
                                 [representative_min_time,
                                 representative_max_time],
                                 color='brown',
                                 label='Mean value of characteristic, 1996-2016')
                        plt.plot([meanval+stddev, meanval+stddev],
                                 [representative_min_time,
                                  representative_max_time],
                                 color='brown',
                                 label='Mean value +/- std dev, 1996-2016')
                        plt.plot([meanval-stddev, meanval-stddev],
                                 [representative_min_time,
                                  representative_max_time],
                                 color='brown',
                                 label='Mean value +/- std dev, 1996-2016')
                    else:
                        data_present = False
                        if debug:
                            print("No data from Olympic data set for event.")
                    # Only plot if there is some data
                    if data_present is True:
                        # Construct the graph title and axis labels
                        (gender_label,
                         event_label,
                         category_label,
                         unit) = build_graph_labels(
                                                    gender,
                                                    category,
                                                    event,
                                                    c)
                        plt.xlabel(c+'({})'.format(unit))
                        plt.ylabel('Time')
                        plt.title("Graph {0}: Variation in {1} {2} Times with {3}"
                                  .format(
                                            graph_number,
                                            gender_label,
                                            event_label,
                                            c))
                        plt.legend()
                        plt.show()
                        graph_number += 1
                    data_present = True

        # Plot the relationship between mean values of charcteristic per event
        plt.figure(figsize=(18, 9))
        plt.errorbar(event_list_m, meanval_list_m, yerr=stddev_list_m, fmt='o',
                     label='Mean +/- Standard Deviation')
        # Construct the graph title and axis labels
        gender_label, event_label, category_label, unit = build_graph_labels(
                                                                    'M',
                                                                    category,
                                                                    event,
                                                                    c)
        plt.xlabel('Event')
        plt.ylabel(c + ' ' + unit)
        plt.title("Graph {0}: Mean +/- Standard Deviation {1} of {2} Athletes per Event"
                  .format(graph_number, c, gender_label))
        plt.legend()
        graph_number += 1
        plt.show()

        plt.figure(figsize=(18, 9))
        plt.errorbar(event_list_f, meanval_list_f, yerr=stddev_list_f, fmt='o',
                     label='Mean +/- Standard Deviation')
        # Construct the graph title and axis labels
        gender_label, event_label, category_label, unit = build_graph_labels(
                                                                    'F',
                                                                    category,
                                                                    event,
                                                                    c)
        plt.title("Graph {0}: Mean +/- Standard Deviation {1} for {2} Athletes per Event"
                  .format(graph_number, c, gender_label))
        plt.legend()
        graph_number += 1
        plt.show()
In [135]:
plot_time_vs_characteristics_time_groups(event_categories, characteristics)
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>

The analysis can be found in section 5.3

5. Evaluate Results

5.1. How have athletes' performances changed through history?

This is an analysis for the results in section 4.1

Graphs 1-24 show that top performances have improved for almost every event and gender combination. The one and only exception is women's 3000m steeplechase, but that is because we have very limied data (spanning only 8 years). All other events show an improvement over time.

The biggest proportional improvement is 85% in the men's marathon. But compare this to the improvement in the men's half marathon: a much smaller 4.2%. This illustrates an important point: the men's marathon has 121 years of data, but the half marathon has only the last 31 years of data. The difference in the improvement for the men's marathon and the half marathon is partly because the rate of improvement for every event was much steeper prior to about 1960. Since then, results have improved much more slowly. This can also be seen in the shape of the graphs, which often have a shape like an exponential decay curve.

Many of the men's events have improved in the range of 20-30% throughout the data set we have available. There is no clear relationship between the distance run and the proportional improvement. For example, two of the sprint events (100m and 200m) have improved by 31.5% and 19.3% respectively, and two of the longer track events (5000m and 10000m) have improved by 20.6% and 29.0% respectively. So, a similar range of improvement at two different ends of the track running distances.

The decrease in the rate of improvement since about 1960 also explains why the proportional improvement in women's events is often less than that of the equivalent men's event. Women have been only been allowed to participate at all distances relatively recently, so there is much less data available and it does not stretch back in time as far as for the men's events. However, where the length of history is comparable for each gender, it's clear that women and men are improving at a comparable rate. A good example is the half marathon, which have 34.5 years of history for the women, and 0.8 years for the men. The proportional improvements are similar: 7.37% for women, 4.22% for men.

Many of the women's events have improved in the range of 5-20%. The events with the longest history tend to be the ones which show the largest improvement, for the reasons already discussed.

5.2. How have athletes' characteristics changed through history?

This is the analysis for the results in section 4.2. Multiple events are shown on the same plot, and a best fit trend line is shown for each event.

5.2.1. Height

5.2.1.1. Track (Flat) Events

For the track (flat) events, both male and female athletes show a similar relationship between the event distance and athlete height.

  • 400m athletes are tallest
  • 200m/800m athletes are next and are similar in height to one another
  • 1500m athletes are next in height, closely followed by 100m athletes, and this order has reversed in relatively modern times.
  • The long distance athletes are the shortest (10000m athletes tending to be shorter than 5000m athletes).

All the female flat track athletes have got slightly taller, with 1500m showing the biggest increase.

For men, 400m, 200m and 800m athletes have tended to get taller. 100m athletes have got shorter. For the other distances, the data doesn't go back very far, so from the data available these remaining heights seem unchanged through history.

5.2.1.2. Steeplechase

There is insufficient data to draw any conclusions about the women's steeplechase. The men's event shows a clear increase in height.

5.2.1.3. Hurdles Events

Mean height increased for both women and men in all hurdles events. For men, the longer distance hurdlers (400m) tend to be taller than the shorter distances, but the reverse is true for women. The hurdlers tend to be slightly taller than athletes at their flat track equivalent distances.

5.2.1.4. Marathon

Male marathoners show a very slight tendency to increase in mean height over time, and female marathoners show no change (though with a small range of available data). The marathon runners of both genders have similar height to the 10000m track runners.

5.2.2. Weight

5.2.2.1. Track (Flat) Events

For both male and female athletes there is a relationship between weight and distance of the event.

  • Short distance athletes are heaviest and show a tendency to increase in mean weight through history.
  • Middle distance athletes are next in mean weight. 800m athletes have maintained the same mean weight through history. Male 1500m athletes have got slightly heavier, female 1500m athletes have got slightly lighter through history.
  • Long distance athletes are the lightest. The men's mean weight has tended to decrease through history, and for woment this has stayed fairly constant (based on limited data).

5.2.2.2. Steeplechase

There is insufficient data to draw any conclusions about the women's steeplechase. There is no change shown in mean weight of male athletes, and this is at a similar level to male steeplechase or 5000m runners.

5.2.2.3. Hurdles Events

Mean weight increased for both women and men in all hurdles events. For men, the short distance hurdlers tend to be heavier than the longer distance hurdlers, but the reverse is true for women. The hurdlers tend to be heavier than athletes at their flat track equivalent distances, except for for female 400m hurdlers/flat runners, which have similar mean weights.

5.2.2.4. Marathon

The male and female marathoners have similar weights to the equivalent 5000m and 10000m athletes. The mean weights for both remain fairly constant through history.

(Side note about the men's results (Graph 40): there is one suspicious-looking point for the 1896 Games, of an anomalously heavy athlete. This is indeed a valid data point, a competitor who weighed 106 kg. He finished 6th out of 17 athletes in the 1896 Olympic marathon, although sadly we do not know his time. He must have been quite an unusual long distance athlete with this weight. Source: https://en.wikipedia.org/wiki/Dimitrios_Deligiannis.)

5.2.3. Age

For every event and gender, mean age of the competitors has increased through history.

5.2.3.1. Track (Flat) Events

For both genders, there is a general trend that runners of the longer distances tend to have a higher age than those of the shorter distances.

5.2.3.2. Steeplechase

There is insufficient data to draw any conclusions about the women's steeplechase. The mean age of the male competitiors is similar to that of the 5000m flat track athletes and has stayed fairly constant.

5.2.3.3. Hurdles Events

The mean ages for both distances of hurdling, and for both genders, are similar, and also similar to the equivalent flat track distances.

5.2.3.4. Marathon

The mena age of marathon runners of both genders is slightly higher than that of any of the long distance track athletes.

5.2.4. Body Mass Index (BMI)

5.2.4.1. Track (Flat) Events

There is a very clear link between the distance of the race and mean athlete BMI. The shorter the distance, the higher the BMI. This is true for both genders. In addition, Graphs 49 and 50 show an approximately diverging set of lines, meaning the trend is for the mean BMI of shorter distance athletes tends to increase through history, whilst it tends to decrease for the longer distance events. The separation between events where BMI is increasing an decreasing comes between 400m and 800m.

5.2.4.2. Steeplechase

There is insufficient data to draw any conclusions about the women's steeplechase. For men, mean BMI is tending to decrease through history, and is similar to the mean BMI of 5000m athletes.

5.2.4.3. Hurdles Events

Mean BMI has tended to increase for all events and genders except male 400m hurdles, where it has stayed fairly constant. Female 100m hurdlers have similar mean BMI to 100m flate track females, but female 400m hurdlers have a lower BMI than 400m track flat females. Male hurdlers have very similar man BMI to the euivalent track flat males.

5.2.4.4. Marathon

Mean BMIs have tended to show a slight decrease through history. The mean BMIs of marathoners of both genders are similar, or slightly higher than the mean BMI of 10000m and 5000m track runners.

(Side note: Graph 56 shows the same anomalous (but real) data point for the result in 1896. See the explanation in section 5.2.2.4..)

This is the analysis for the results in section 4.3.

5.3.1. Graphs 57 - 144: Athlete Characteristics vs Result Time per Event - All History

The previous section has already shown that athletes from a particular event tend to have common characterisitics (for example: the shorter the distance, the higher the mean BMI). This set of graphs looks more closely at Olympic athletes for a particular event (rather than across the whole population of thletes in all running events). The iam is to look for any relationship between any of the four athlete characteristics and their performance within these very select groups. Just for example: do we find that of all the athletes entere for the 100m, the heaviest athlete has a higher chance of winning? This is just a possibility to illutrate the kind of relationships we're looking for.

For height, most of the results tend to cluster near the mean height (bell curve shape). This includes both good and less good results, medal winners and non-medal winners, so really this is just reflecting the distribution of heights of the athletes for a particular event - in other words, there are more athletes with heights close to the mean for that particular event.

Results versus weights also often clusters near the mean in a bell curve shape. There are some cases where the curve is more skewed. For example Graph 80 shows more results above the mean value than below, but fewer and more consistently good results above the mean value of weight. This reflects the fact observed in the previous section that short distance athletes have tended to get heavier through history, and this has accompanied the improvement in performance.

For age, the results also tend to cluster about the mean. However, as with weight, the results sometimes follow a more skewed distribution, with the largest number of results below the mean age, and more consistently good results at higher age. This is clearer in the cases where more data points are available. This also partly reflects the changes in athlete population: section 5.2.3. noted that athletes at all events have tended to get older through history. We can observe that as mean athlete age has increased, athlete performance has also improved.

The BMI graphs have slightly different distributions for short distance, middle distance and long distance events.

  • Short distances tend to be skewed so that there are more results at lower BMI, and fewer but more consistently good results at higher BMI. Graph 124 shows this well.
  • Middle distances have an even distribution about the centre (not skewed). Graph 131 shows this well.
  • Long distances tend to be skewed so that there are more results at higher BMI, and fewer but more consistently good results at lower BMI. Graph 134 shows this well.

This is consistent with section 5.2.4.) which showed that BMIs for short distances have tended to increase, for middle distances they have stayed fairly constant, and for long distances they have tended to decrease. This has accompanied an improvement in results for all events.

In summary, these graphs all tend to show a normal distribution or skewed normal distribution. Beyond that, there is no clear trend in the time versus characteristic graphs, when we focus narrowly on the group of athletes competing in a particular event.

5.3.2. Graphs 145 - 240: Athlete Characteristics vs Result Time per Event - Grouped in 20 Year Periods

This analysis aims to isolate the effect of the four athlete characteristics and see what relationship they have with performance. However, it's obvious that there are many other factors that have changed through history that could also affect performance. For example, general healh and nutrition has improved and training facilities have got better. To mitigate these effects, graphs 145-232 plot the same results as in graphs 57-144, but using colour coding to show when those performances were. This allows an easier comparison between results, and in particular, the most modern results. These graphs all show the mean and standard deviation of the characteristic being plotted for athletes in the period 1996-2016 (rather than for all athletes in the results set).

The results from 1996-2016 occupy a much narrower band of times, from the minimum to the maximum time, than the full results set. It's clear that the best performances in the 1996-2016 period are spread across a fairly wide range of values of each characteristic. As with graphs 57-144, there is no clear trend in these graphs. Top performances come from athletes within 1-2 standard deviations of the mean of any particular characteristic.

5.3.3. Comparing Variation in Athlete Characteristics

Graphs 167, 168, 191, 192, 215, 216, 239 and 240 show the relationship between the mean and standard deviation of the characteristic for the athletes competing between 1996 and 2016 for each event. The previous sections (5.3.1. and 5.3.2) showed that athletes within 1-2 standard deviations of the mean for Olympic athletes in that event have the potential to deliver top performances. These graphs are another way fo examining how critical each characteristic is for each event.

5.3.3.1. Height

For both men and women, the mean heights in a significant number of events fall within one standard deviation of one another. Competitors in 400m flat and both hurdles events tend to have a greater height, the long distance events have the least height and the remaining events fall between and have significant overlap. A relatively short Olympic sprinter would be a similar height to a relatively tall Olympic marathon runner. The range of heights for women's 100m hurrdles looks anomalously low, though this could be due to the relatively small amount of data for female athletes in this event.

5.3.3.2. Weight

Mean weight for athletes in different events are more widely separated. The mean weight of the long distance runners is well separated from that of the short distance sprinters and does not even overlap much with the mean weights of middle distance athletes. This distinction is clear for both genders but for men in particular. So an athlete with a weight near that of a sprinter is unlikley to perform well in a marathon.

5.3.3.3. Age

For both men and women, the standard deviation age ranges overlap for every event.

5.3.3.4. BMI

The distribtion is similar to that of mean weights. The mean weights are quite well separated, with the longer distance athletes having lower BMI, and short distance athletes having a higher BMI. A sprinter's BMI is not suitable for a marathon runner, and vice versa. Female hurdlers are an interesting case. Female 100m hurdlers tend to be shorter than female 400m hurdlers, but they have a noticeably higher BMI.

6. Conclusion

6.1. Performance Improvments

Performances have improved across all events and genders. This reflects improvements in training techniques, general health, facilities and the application of science to sport. The rate of improvement has slowed down since about 1960, which is probably due to diminishing returns as ahletes and coaches get better.

Men's performances have generally improved in the range of 20-30%. Women's performances have generally improved in the range 5-20%. However, this difference stems largely from the fact that women have only be able to compete at all distances for a relatively short time, mostly after 1960, which is when the rate of improvement has generally been slower in all events. When there is a comparable data history for both men and women in the same event (e.g., half marathon) then the improvements in both ethe men's and women's events are similar.

6.2. Athlete Characteristics

  • Each event has its own different value for mean height, weight, age and BMI. This combination is like a "fingerprint" that characterises athletes that compete in that event.
  • There is a clear relationship between weight and distance: athletes competing in shorter distances are heaviest and have increased in weight through history, whilst athletes competing in longer distances are lightest and have tended to decrease in weight through history. This makes sense: short distance athletes rely on explosive power for a short time, whereas long distance athletes don't need high instantaneous power, but instead need to efficiently maintain a steady workrate for a long period.
  • Competitors in shorter distances have higher BMI than competitors at longer distance. This makes sense for the same reasons as weight.
  • The mean age of all competitors has increased through history. There are several possible reasons for this. One possibility is that athletes have moved from amateur to professional status, so they no longer need to cut short their athletic careers in order to go and earn a living. Another possibility is better ongoing care for athletes and recovery from injury, increasing their longevity. It also makes sense that the more years of training an athlete can complete, then the better they will get, at least up until some threshold where they may go through increasing physical decline.

6.3. Relating Athlete Characteristics to Performance

It's clear that athlete characteristics are important in determining which event an athlete will compete in: an athlete with a particular combination of height, weight, age and BMI (by implication) will be suited to particular events and less suited to others. This shows there is a relationship between athlete characteristics and performance.

This analysis also looked at single events in isolation and looked at the athletes competing in that event. It showed that although the characteristics of athletes for a particular event cluster about a mean, and the combination is characteristic of that event, athletes with a wide range of individual height, weight, BMI and age still produce top performances. The top performances do not cluster very closely to the mean - some top performances fall 1-2 standard deviations away from the mean.

The practical applications of these results for athletes and coaches are:

  • Each event has a combination of height, weight, age and BMI which is a "sweet spot" where athletes are more likely to be successful. For example, the BMI for a successful sprinter is higher than for a successful marathon runner. Knowing this could influence an athlete's training, for example they can perform specific training to gain or lose weight, if necessary. It is also useful for athlete and coach to determine which events an athlete is likely to perform well in - if high performance is the aim, then it makes sense to choose the event based on the characteristics of the athlete.
  • For high performance in a particular event, an athlete would ideally have a height, weight, age and BMI somewhere close to the mean value of these characteristics for the Olympic athletes in that event. However, the range of values exhibited by top performing athletes is still quite wide, and for some events, these ranges overlap. Some extremely highly performing athletes in the modern Olympic Games have characteristics with values far away (>2 standard deviations) from the mean. Usain Bolt is a great example of such an athlete - unusually tall and heavy for a sprinter, yet the world record holder.
  • Of all the characteristics examined, weight seems to be the most important factor in determining athlete performance. This is because graphs 191 and 192 show that the mean weights of athletes in different events are well separated from one another. There is some commonality for certain events - weight ranges for sprint events overlap, and weight ranges for long distance events overlap each other, but the long and short distance events definitely don't overlap. Middle distance events fall into their own group between the two. BMI follows a similar pattern because it is a function of weight. The fact that weight is the characteristic that can be trained for (gaining or losing weight) also makes this a useful practical observation.
  • Within the (admittedly narrow) range in this data set, age seems to have the least influence over performance in a particular event. It should be noted, though, that this is data for Olympic athletes, rather than the general poulation, so this data set is a narrow range.
  • Longer time spent training produces better athletes, and this is demonstrated by the increasing age of competitors at all events.
  • Performances are still improving for both genders. The data shows diminishing returns but no sign of the improvements coming to an end.

6.4. Further Work

Many factors influence athletes' performance, and this study has looked at just a few of them. A more extensive study would examine the effect of factors such as training schedules (e.g., distance run per week, techniques such as weight training), diet, and athlete characteristics such as VO2 max.

7. Deployment

This analysis is published on GitHub (https://github.com/mattjezza/ds-proj1-t2-elite-athletics) and summarised in a post on Medium titled "Analysis of Elite Running Performance using Historical Data".